home *** CD-ROM | disk | FTP | other *** search
Wrap
Text File | 1992-04-17 | 710.6 KB | 22,872 lines
812. Date: Mon, 1 Jan 90 13:52:52 PST From: tve (Thorsten von Eicken) Subject: migrating a process group I'm starting to use migration for things other than pmake and I regularly would like to migrate a "program" which consists of a number of processes. For example a shell script which invokes many awk scripts and c programs. Now, when I "mig -p" the top shell process, the migartion has no effect until the child processes (which do the real work) terminate and the shell script starts new ones. If I migrate the childs, then new childs will reappear on my machine. Migrating all processes requires a large number of idle hosts since each process must be migrated separately. What I would like, is to be able to either specify more than one process per mig command or to be able to specify a parent process and have it migrated with all its childs. Note that in both cases all processes should be migrated onto the same host. (I would prefer the "migrate parent+childs" option). Any comments? Difficult to do? 813. Date: Mon, 1 Jan 90 18:54:55 PST From: tve (Thorsten von Eicken) Subject: re-migration desn't work Once a process is migrated to a host, it's pretty much stuck there. Children of this process are also stuck there. Examples: 1) I migrated a process. This process forks a few times. Now that host is overloaded. I have no way of redistributing the processes, except for evicting all processes from that host and then migrating the processes again. 2) I migrated a shell script. This shell script starts lots of processes, and asks mig to migrate each of them. Mig returns "Error execing program: invalid argument" and the process is NOT executed. In other words: you are allowed to migrate once and only once. If you try again, you die! It seems that the migration mechanism should be more general: "shove this process onto an idle hosts different from the one it's on!". I don't know if it's the mig command which is too simple minded or if the kernel just doesn't allow to push processes around arbitrarily. 814. Date: Tue, 2 Jan 90 01:20:43 PST From: tve (Thorsten von Eicken) Subject: manual page for ggraph hopelessly out of date some commands listed don't exist, others exist but aren't listed. Is anyone maintaining this program? 815. Date: Tue, 2 Jan 90 03:30:26 PST From: tve (Thorsten von Eicken) Subject: more compiler problems on sun4 Well, the new version of xgraph I just got doesn't compil either, but it's a different story this time. When compiling on the sun4, a lot of symbols are missing when the final linking is done. When compiling on a sun3 for a sun4, everything works fine. Needless to say that everything is fine when compiling on sun3 for sun3 and on ds3100s. I think I saw this kind of bug a few weeks ago and signalled it. I think it was supposed to be fixed. To reproduce: cd /X11R3/src/cmds/xgraph pmake clean pmake (all on a sun4) 816. Date: Tue, 2 Jan 90 10:43:05 PST From: brent (Brent Welch) Subject: mail duplication Is this a known bug? Scenario: reading lots of mail, and during this time a new message arrives, plus I send a new message to spriters as about the last thing I do in the mail session. I quit mail and get "new mail has arrived". I immediatly go back into mail and see a new message (not the one I sent), which I read and delete. I exit mail a second time and get another "new mail has arrived" message. When I reenter mail the message which I read and deleted last time has reappeared, along with the memssage I sent my self in the first mail session. This has happend on more than one occasion. Is this some flaw in the mail spool file locking protocol? Anybody know what that locking strategy is? 817. Date: Wed, 3 Jan 90 09:11:01 PST From: ouster (John Ousterhout) Subject: Trashed file I found another trashed file today: /sprite/lib/forms/cmd.man. I moved the file to /sprite/trashed/sprite-lib-forms-cmd.man. Since this was an RCS-ed file I checked out a new version and diff-ed them to find out exactly what had changed. The first 1024 bytes of the trashed file were the same as the original, but everything after that was different, apparently consisting of a piece of someone's bibliography database. Interestingly, the total length of the file was unchanged by the substitution. Brent, could your recently-fixed cache bug explain this? The trashing could have occurred a long time ago. 818. Date: Thu, 4 Jan 90 10:47:45 PST From: brent (Brent Welch) Subject: gremlin is broken Gremlin behaves terribly for me running on a Sun3 under X11R3. Whatever gremlin does when it flashes the selection and moves (or copies) the selection is busted so that the screen gets more and more trashed as you work. Occasionally, for example, the mouse cursor ges caught by a flashing selection and remains behind. I'm pretty sure this is a server problem because an old (R2) version of gremlin exibits the same behavior. 819. Date: Fri, 05 Jan 90 14:38:02 PST From: rab (Robert A. Bruce) Subject: Re: gcc bug The bug that JohnH reported about compiling perl is a byte-order bug that occurs when a procedure that passes a double is compiled for a sun3 on a ds3100. main() { foo(1.0); } --------------------- Compiled on a ds3100 -------------------------- _main: link a6,#0 movel #1072693248,sp@- <--- These two instructions clrl sp@- <--- are in the wrong order. jbsr _foo unlk a6 rts ----------------------------------------------------------------------- Until this gets fixed, don't compile sun3 floating point stuff on the decStations. 820. Date: Sun, 7 Jan 90 20:01:59 PST From: tve (Thorsten von Eicken) Subject: on the ds3100, gcc and cc have different struct return conventions Example: _ file foo.c struct goo {int i,j; }; extern struct goo hoo(int); main() { struct goo l; l = hoo(3); printf("result: %d %d (=?= 3 4)\n", l.i, l.j); } _ file hoo.c struct goo { int i,j; }; struct goo hoo(int k) { struct goo l; l.i = k; l.j = k+1; return l; } _ Now compile one with gcc, the other with cc, link together and watch a "Bad user TLB fault" come up. If you look at the assembly output, it is evident that things can't work together. Is there a magic flag to gcc to convince it to use the same convention as cc? 821. Date: Sun, 7 Jan 90 21:08:44 PST From: eklee (Edward K. Lee) Subject: lost files Yesterday (Saturday), I created a directory named ~eklee/mult to compile a modified version of Pete's mult program. I was executing ~eklee/mult/sun4.md/mult from raid when raid crashed (a bug in my kernel). I then discovered that everything, including all subdirectories (even . and ..) in the mult directory were missing. (I'm not sure if the two events are related.) Afterwards, I was unable to create files in the mult directory. I only lost a small amount of work which I've since recreated. Here's a transcript which illustrates the strange behavior. ---------------------------------------- forgery% ls -a mult total 0 forgery% mkdir jnk forgery% ls -a jnk total 4 1 ./ 3 ../ forgery% find mult -name a find: bad directory <mult> forgery% cat > mult/t hello <> forgery% ls -a mult total 0 forgery% ---------------------------------------- 822. Date: Mon, 8 Jan 90 11:52:11 PST From: brent (Brent Welch) Subject: mkscsidev I used the mkscsidev script to create the device file for the exabyte on Allspice. Two things. First, I still haven't found the magic place where the HBA type number is defined. I reverse engineered another device file to determine that the SCSI3 is probably type 0. Second, the device type of the device file created corresponded to a disk (4) not a tape (5). The unit number was correct for HBA #3, Target 5, but I had to generate the device file by hand in order to get the right device type. 823. Date: Tue, 09 Jan 90 23:00:13 PST From: Fred Douglis <douglis> Subject: tx infinite loop on data file I inadvertently tried to cat a non-ascii file and tx went into an infinite loop on me. Try catting ~douglis/test-tx. I think it may have something to do with the fact that the line is very long and tx may think it's some sort of command. Fred 824. Date: Wed, 10 Jan 90 13:04:06 PST From: brent (Brent Welch) Subject: bad sun4 Exabyte driver? Putting the Exabyte on SCSI#0 doesn't affect its behavior. I was again able to write a tar tape, but when reading it back I got a SCSI select failure after about 100K. The tar file was about 4Megs, and I was able to read it on Murder's exabyte ok. I suspect a timing problem with the Sun4 version of the driver. We may have to add so select retries in case the tape drive is getting into some funny state where it takes a long time to respond. 825. Date: Wed, 10 Jan 90 16:17:11 PST From: Fred Douglis <douglis> Subject: trashed file /c/stats/sloth/6Jul should be a directory; instead ls claims it's a socket and update doesn't know how to do anything with it. 826. Date: Fri, 12 Jan 90 10:27:42 PST From: brent (Brent Welch) Subject: Sequent sun 3/50 The latest kernel (1.051) still breaks on the 3/50 up at Sequent. The system runs longer than it used to, but eventually it freaks out and there is apparently bad data in the cache. The symptom is that execs() fail because of a bad a.out header. I've told fubar to try fixing the size of the file system cache to see if it behaves better. I think there is either another hideous bug in the cache, or (I hope) something about the 3/50 architecture that we aren't taking into account. I think we should get our hands on a 3/50 so we can do some testing here in a controlled situation - let's make this an agenda item. 827. Date: Fri, 12 Jan 90 11:49:22 PST From: ouster (John Ousterhout) Subject: Bad magic number When I remade the mx library for the Sun4, using a DS3100 for the compilation, I got a "bad magic number" error when I tried to use the resulting .a file in a link (where the link was also run on a DS3100, using gcc). Here's some sample output: piracy: pmake install TM=sun4 --- sun4.md/mx.o --- rm -f sun4.md/mx.o gcc -g -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -c mx.c -o sun4.md/mx.o --- sun4.md/mx --- rm -f sun4.md/mx gcc -g -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -o sun4.md/mx sun4.md/mx.o -lmx_g -lsx_g -lcmd -ltcl -lX11 ld: Bad magic number in /sprite/lib/sun4.md/libmx_g.a(gth:1) *** Error code 1 828. Date: Fri, 12 Jan 90 11:53:51 PST From: ouster (John Ousterhout) Subject: Re: tx infinite loop on data file Fred reported the following problem: I inadvertently tried to cat a non-ascii file and tx went into an infinite loop on me. Try catting ~douglis/test-tx. I think it may have something to do with the fact that the line is very long and tx may think it's some sort of command. I've fixed this bug and I'm currently installing new versions of Mx and Tx. I believe that this same bug is responsible for a similar infinite loop that jhh reported a while ago. Thanks for the repeatable trigger, Fred. 829. Date: Fri, 12 Jan 90 16:46:43 PST From: Fred Douglis <douglis> Subject: pdev man page out of date it refers to byteOrder as a field in an ioctl struct, but in fact it's "format". I'll change that one mistake, but it suggests there might be other outdated fields or parameters mentioned in the man page that would bear reexamination. 830. Date: Fri, 12 Jan 90 17:36:30 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: unix copy of sources wrong The directory /sprite3/src/kernel/libc on unix does not contain any source files. 831. Date: Sun, 14 Jan 90 11:45:29 PST From: tve (Thorsten von Eicken) Subject: /sprite/admin/userLog is not a multiple of 81 bytes I keep getting that whenever I do a finger. Sounds like something is broken!? It started late yesterday evening. 832. Date: Sun, 14 Jan 90 16:37:05 PST From: tve (Thorsten von Eicken) Subject: spritemon -v seems to invent memory On crackle, a sun4 with 12Megs of physical memory, if I run spritemon -vM -fM I am usually in a state with 1/4 to 1/3 of a Meg devoted to the fs cache and 12 megs devoted to VM. Where does the extra physical memory come from? Not to mention the memory used by the kernel... 833. Date: Sun, 14 Jan 90 16:54:03 PST From: tve (Thorsten von Eicken) Subject: Re: curious spritemon values for -f and -v If I start a spritemon -f% -v%, I get about 50% user VM size. Does this mean there is a problem in deciding what the size of a "thing" returned by Fs_Cmd and Vm_Cmd is? I saw that there is no machine dependent code of that sort in spritemon. 834. Date: Sun, 14 Jan 90 22:35:19 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: xproof on a ds3100 Xproof on a monochrome ds3100 always puts my X server into the debugger. For an example type "xproof troff.output" in ~jhh/proj/paper. 835. Date: Sun, 14 Jan 90 23:13:23 PST From: tve (Thorsten von Eicken) Subject: vmcmd -R 1 creates havoc on sun4c This enables the "use fs read-ahead" in VM. Of course I used a fscmd -r 1 command before. Why is this not the default anyway? 836. Date: Mon, 15 Jan 90 11:33:12 PST From: brent (Brent Welch) Subject: Re: vmcmd -R 1 creates havoc on sun4c file system read ahead is not ordinarily enabled because it doesn't (or didn't) provide much of a win. We suffer a context switch to a different process in the implementation of read ahead, and we were never getting more than one block per disk revolution anyway. So much for background. The VM system either calls Fs_PageRead or Fs_Read to fill in a page. The former has smarts about the cache and remote swap files - it tries to avoid simply moving a VM page into the FS cache. The later is used in an attempt to gain something from FS read ahead. If this breaks (can you describe exactly what the "havoc" is?) we should fix it. brent ps. So if you "vmcmd -R 1" you change the system from using Fs_PageRead to Fs_Read, the call is made from VmFileServerRead. 837. Date: Mon, 15 Jan 90 11:45:17 PST From: brent (Brent Welch) Subject: spritemon -v The spritemon program only computes megabytes right on hosts with 8K page sizes, oops. I'll fix it to pay attention to pagesize. 838. Date: Mon, 15 Jan 90 13:53:52 PST From: tve (Thorsten von Eicken) Subject: intricate problem with filesystem read-ahead and migration I have the following scenario: Machines: crackle and burble Command: pmake dependall (with sun3, sun4, ds3100, about 20 source files) Symptom: --- dependsun3 --- cannot read all of apGenerate.c When: if I turn file system read-ahead on on burble (fscmd -r 1) then the pmake depend which gets migrated to burble hits this mysterious error. If I say "fscmd -r 0" on burble, then "pmake dependall" on crackle, everything is fine. Then I immediately say "fscmd -r 0" on burble and "pmake dependall" and I get this error reliably. I can switch back and forth between the two states. Where: the directory is ~octtools/src/lib/ace, but I don't really think it matters (except that there are quite a lot of files, and many more are included, so the fs cache gets to work hard). 839. Date: Mon, 15 Jan 90 18:50:54 PST From: tve (Thorsten von Eicken) Subject: mkmf.md doesn't allow source files to start with digits I fixed it to allow that since I don't see any reason to disallow it. If there are objections, please back out and tell me why. 840. Date: Mon, 15 Jan 90 19:26:46 PST From: tve (Thorsten von Eicken) Subject: sun4 c compiler generates illegal assembly output This results in a "/tmp/cc276319.s:107:Illegal operands" style message. The culprit line looks like this: ldd [%lo(_AceG+112)+%g1],%f0 by changing it to ldd [%g1+%lo(_AceG+112)],%f0 everything is fine. To test: cd ~octtools/src/lib/ace; cc '-DCADROOT=\"/users/octtools\"' -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -I/users/octtools/lib/include -c genMove.c 841. Date: Tue, 16 Jan 90 10:47:21 PST From: brent (Brent Welch) Subject: 3/50 dma bug After some further investigation I have determined that the first 16 bits of the cache block are corrupted after a Disk read. This doesn't happen on every read. I don't know the SCSI code well enough to debug the DMA system. There is code in the SCSI driver that is #ifdef'd out that fishes the last short-word from the DMA fifo. I can't see how this would mess up the beginning of a buffer, unless the left-over lingers around til the next time? Who knows. I have to stop working on this, but I would really appreciate it if someone else would carry the ball. (hint hint) 842. Date: Tue, 16 Jan 90 14:39:29 PST From: david@rosemary.Berkeley.EDU (David A. Wood) Subject: rcp and ftp Don't seem to work for transfering files from Sprite to UNIX. I'vd tried all combinations (initiate on Unix, initiate on Sprite) and it rarely transfers more than 100K bytes (although one try did transfer 700K). On the UNIX side for rcp I never get an error message. Sprite rcp says something like 'lost connection'. Unix side ftp says 'netin:connnection reset by peer'. Sprite side ftp says 'netout: broken pipe'. 843. Date: Tue, 16 Jan 90 15:10:12 PST From: tve (Thorsten von Eicken) Subject: problems when making libraries (timestamp ?) When I "pmake install" a library, I often get the following scenario: _ One or two source files are out of date. They get recompiled. _ The "ar" command which adds the new .o files to the .a file adds the recompiled .o file *plus* a few additional .o files which were not out of date and not recompiled. These .o files of course don't exist and ar print out error messages. I suspect that this is due to unsychronized clocks where the .o file is more recent than the corresponding .c file (correct) but also more recent than the .a file (incorrect) because things were done on different machines (migration). Would running rdate more often than daily alleviate this? Why is there no timed running in sprite? Here's an example of above problem: [crackle rpc] pmake install pmake: Lockfile owned by you -- ignoring it --- installman --- No man pages for library rpc? Please write some. --- sun4.md/appTemplate.o --- rm -f sun4.md/appTemplate.o cc "-DCADROOT=\"/users/octtools\"" "-DCUR_DATE=\"`date | awk '{print %2, %3, %6}'`\"" "-DCUR_TIME=\"`date | awk '{print %4}'`\"" -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -I/users/octtools/lib/include -c appTemplate.c -o sun4.md/appTemplate.o --- sun4.md/librpc.a --- ar r sun4.md/librpc.a sun4.md/appReg.o sun4.md/appTemplate.o sun4.md/rpc.o ar: cannot open sun4.md/appReg.o ar: cannot open sun4.md/rpc.o /sprite/cmds.sun4/ranlib sun4.md/librpc.a rm -rf sun4.md/appDM.o sun4.md/appInit.o sun4.md/appNet.o sun4.md/appOct.o sun4.md/appRPC.o sun4.md/appReg.o sun4.md/appTemplate.o sun4.md/appVem.o sun4.md/rpc.o --- /users/octtools/lib/sun4.md/librpc.a --- rm -f /users/octtools/lib/sun4.md/librpc.a /sprite/cmds.sun4/cp sun4.md/librpc.a /users/octtools/lib/sun4.md/librpc.a /sprite/cmds.sun4/ranlib /users/octtools/lib/sun4.md/librpc.a 844. Date: Tue, 16 Jan 90 15:15:41 PST From: eklee (Edward K. Lee) Subject: xbiff dies on ds3100 Xbiff dies every 12 to 24 hours on ds3100's (forgery). Ed ---- forgery% xbiff -B & [2] e2b32 forgery% XIO: invalid argument 845. Date: Wed, 17 Jan 90 19:49:28 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: distribution bug There are several soft links with absolute pathnames in the distribution root directory that really should be relative pathnames. The link from boot/cmds -> /sprite/cmds doesn't work so well if I have the disk mounted as /t6. I ended up copying a test version of initsprite to /sprite/cmds. I think that only the links in the boot directory matter, but there may be others waiting to bite people. 846. Date: Thu, 18 Jan 90 11:58:02 PST From: brent (Brent Welch) Subject: nfsmount /spur is dying nfsmount is repeatably failing on /spur. The first attempt to open a file relative to a working direction under /spur fails. I don't have time to debug nfsmount right now, perhaps I will later this afternoon. I can't tell much from gdb because it is apparently in Sig_Send() after getting some random signal. I have to step through this. In the mean time /spur is unavailable via nfsmount (someone could try mounting it on assault), and NFS access to other systems might also be flakey. 847. Date: Thu, 18 Jan 90 14:40:20 PST From: brent (Brent Welch) Subject: /spur is back The bottom line is that /spur is back, and the other NFS partitions seem ok. However, Oregano was definitely in a weird state, but I would have had to run the kernel debugger to figure it out. When I went to debug nfsmount, for example, I found that I could no longer even start up the pseudo-file-system. Some piece of internal state (maybe even on mint, which stores the /spur remote link) was goofed up. I patched around this situation by recreating the /spur remote link under a different inode number. This magic ensures that the state associated with the old remote link doesn't get in the way. Anyway, I was able to restart the nfsmount process and I was not able to crash it. This is basically an unresolved pseudo-file-system wierdness. 848. Date: Thu, 18 Jan 90 16:03:18 PST From: elm (ethan miller) Subject: bus errors I get the following bus errors occasionally when I fork off a new shell (doesn't matter which kind): MachPageFault: Bus error in user proc 93e20, PC = 12100, addr = 80a26020 BR Reg 80 These bugs occur quite often (once every 10 process forks). In addition, these processes refuse to die even when I use killdebug or kill -9 on them. This is on a color SparcStation (terrorism). 849. Date: Thu, 18 Jan 90 16:15:56 PST From: tve (Thorsten von Eicken) Subject: sun4 cc get bus error when run on sun4, all ok when on sun3 Try the following few lines: /* contour.c - contour plotting program * * Copyright (C) 1990 by the regents of the University of California * Author - Thorsten von Eicken * */ double *data; static void cont(double clo) { int x; double d0, d1; d1 = data[x]; if((d0 < clo) != (d1 < clo)) { } } 850. Date: Thu, 18 Jan 90 16:22:50 PST From: eklee (Edward K. Lee) Subject: cc hangs on ds3100 forgery% pwd /users/eklee/combin forgery% cc -g -L../sim/ds3100.md -g3 -O -Dds3100 -Dsprite -Uultrix -I/users/eklee/lib/include -I. -Ids3100.md -I/sprite/lib/include -I/sprite/lib/include/ds3100.md -I../sim -I../sim/ds3100.md -c combin.c -o ds3100.md/combin.o <in another window> forgery% ps -au USER PID %CPU %MEM SIZE RSS STATE TIME PR COMMAND eklee c2b42 81.4 2.4 1728 580 READY 0:17 ugen -G 8 -EL -g3 -O2 ... ... Cc works for other files but hangs on combin.c. The problem seems to have been caused by declaring a structure as a formal parameter. When I changed the structure to a pointer to a structure, the problem went away. 851. Date: Thu, 18 Jan 90 20:11:53 PST From: tve (Thorsten von Eicken) Subject: pmake gives "Error code 16" I'm running lots of simulations using pmake and every 20 minutes or so (or in other terms, about every 70 processes) it pmake dies with something like: --- seqs/S-s1-t2.62 --- *** Error code 16 pmake: 1 error Now the simulator doesn't return errors, so what does this error code mean? I tried to RTFM but didn't find anything. -Thorsten (I'm gonna use the -k flag now...) 852. Date: Fri, 19 Jan 90 14:34:41 PST From: brent (Brent Welch) Subject: 3/50 disk status Well, the good news is that I got my 3/50 to boot off of its disk, finally. The bad news is that my one hunch about the failing DMA didn't pan out. I still get some random data returned, even after adding code to deal with the leftover half-word in the DMA fifo. This funny situation turns up, but it doesn't coincide with the bad data I get. So, I'm still stumped, and I can't say how much energy I personally have to work on this stupid driver. 853. Date: Mon, 22 Jan 90 09:11:23 PST From: Fred Douglis <douglis> Subject: ds3100 out of memory, wouldn't enter debugger kvetching died while i was gone. it said it was out of memory. however, debugging it from dill got timeouts. anyone know of something running on kvetching at the time, or why it might not have entered the debugger properly? (it had the normal msg about entering, so i don't know what's up. the net is fine since i was able to start it rebooting.) 854. Date: Mon, 22 Jan 90 10:23:35 PST From: Fred Douglis <douglis> Subject: Re: problems when making libraries (timestamp ?) FYI, the following is in crontab: # # Synchronize your watches # 0 4 * * * root /sprite/admin/Rdate >& /dev/syslog Perhaps this isn't being run properly, but there is certainly an attempt to synchronize clocks once per day. Running timed would be nice; as I recall, porting it was non-trivial and we decided against it the last time the issue came up. Maybe it was just that we were missing some routines that we only had vax sources for, or something like that... perhaps adding timed could become a spring cleaning item? I agree with you, we really can and should synchronize more closely. 855. Date: Mon, 22 Jan 90 10:28:20 PST From: Fred Douglis <douglis> Subject: Re: xbiff dies on ds3100 which version of the X server are you running? I found that was true with the original servers but not X?fb.new. It's not xbiff's fault, it's the server's. 856. Date: Wed, 24 Jan 90 10:50:24 PST From: ouster (John Ousterhout) Subject: Thank goodness for rcsid's If I've ever said anything nasty about rcsid lines in the past, I take it back. To track down the ipServer problem, I ran strings on good and bad binaries to extract all the rcsid strings. Then I methodically started restoring versions back to what they were in the last known-good binary. I noticed that the file Net_InetHdrChecksum.c (in the "net" library) had been modified by jhh to "Allow buffers to be odd-aligned". When I backed out this file to its previous version, suddenly the ipServer started working perfectly. I have to head to USENIX so I haven't figured out WHY the changes broke it (John, if you get a chance you might take a look). In the meantime, I'm going to leave installed what I think is an OK ipServer. This is only for Sun-3's. If it turns out to be buggy, you can overwrite the installed version with "ipServer.ok" from the ipServer source directory. Without rcsid's I don't think I would ever have thought to check in the net library. I wonder if perhaps the versions for ds3100 and sun4 have always been compiled with the new version of Net_InetHdrChecksum (perhaps they have to be?) and that's why they've never worked? 857. Date: Tue, 23 Jan 90 11:22:12 PST From: root (The Sprite God) Subject: locked mail spool file Somehow Ann Chervenak's mail spool file got locked up for a few days. This was an flock() advisory lock, and I was able to run an 'unlock' program (~brent/tmp/unlock.c) to clear the lock. However, this implies that lock recovery is somehow broken. "I'm sure I tested this", but cleary there is some path which can leave a file locked. Suposedly locks are released if the client that holds them goes down. I just rechecked the code and it seems ok. Someone should spend some time pounding on this to see if there isn't a repeatable bug that we can fix. 858. Date: Tue, 23 Jan 90 11:37:20 PST From: Fred Douglis <douglis> Subject: Re: mail snafu Thanks for the info. [Bob and Ann can probably stop reading right Here. :) ] Actually, I think the problem is due to a process on Ann's machine that is in the debugger. If her mail process went into the debugger with the lock held and was never killed, that could cause problems. This suggests that we need to either (1) get rid of "debuggable" processes entirely (back to core files), (2) steal back locks from processes when they enter the debugger, (3) make programs like mail more robust so they catch the debug signal and exit, which will work for mail but break the next time a similar problem occurs, or (4) make programs like sendmail more robust so they won't wait forever for a file to be unlocked. 859. Date: Tue, 23 Jan 90 11:40:53 PST From: gibson (Garth Gibson) Subject: crashed machines Peter has a simulation program that crashes 3100s from time to time; that is, we think he does. When the machine has crashed it is left in the blacked out state with the only record of the problem hidden under the blackout. It may be the floating point exception in the kernel problem. The bugs is the lack of record of syslog or screen messages. 860. Date: Tue, 23 Jan 90 13:56:06 PST From: ouster (John Ousterhout) Subject: Bad ipServer I noticed today that I couldn't ftp large files from Sprite to DEC, and that I also cannot even rcp large files from Sprite to Rosemary. I also noticed that a new ipServer was installed in the last month, and that the previous installation before that was last August. I backed out the ipServer to the version of last August and both the ftp and the rcp worked fine. I'll try to track down which of the zillions of changes since last August is responsible for the problem, but in the meantime I've backed out the ipServers for ds3100, sun3, and sun4. Perhaps this will get rid of the problems people have had copying to and from Sprite? 861. Date: Tue, 23 Jan 90 15:30:06 PST From: culler (David Culler) Subject: Emacs, X and other evils I've been trying to use Sun-3s in the cory Bard cluster and have run into a variety of problems. If I try to run EMACS from a tx window a lot of things are screwed up. I have set the termcap using the control memu. Emacs does come up, but a lot of the key strokes do not work, such ctrl-a. Display is messed up too. "Don't do it!" you exclaim. I agree, so I set the display variable. Well that requires X access permission on this end and xhost does not work. Any other suggestions. I'd fire up an xterm from this end, but that doesn't work. 862. Date: Tue, 23 Jan 90 15:35:50 PST From: tve (Thorsten von Eicken) Subject: Re: Emacs, X and other evils If you want to go for broke: rlogin to rosemary (or other favorite unix box), start an xterm -display cardamom and then rlogin to bard in that. (Oh, it works with rosemary because it's in /etc/hosts.equiv). Real solutions: 1) bring X11R4 up on the ds3100 (hahaha) 2) add bard to /etc/hosts.equiv (why not? security?) 3) port xterm to sprite (hehe) 4) marry tx & emacs 5) run the old X server which accepts xhost (try "xinit -- /ultrix/cmds.ds310/Xmfb") NB: reminder: xhost doesn't work because the X server on the ds3100 dies if it's run. The problem is in the server (probably the authorization stuff is different in sprite than in ultrix). 863. Date: Tue, 23 Jan 90 15:38:05 PST From: tve (Thorsten von Eicken) Subject: new error message in syslog on sun4 FPU exception from process without MACH_FPU_ACTIVE, fsr = 0x68000 FPU exception from process without MACH_FPU_ACTIVE, fsr = 0x68020 hey, never seen that before. Any takers? Dunno what process, dunno what was running... 864. Date: Tue, 23 Jan 90 22:18:16 PST From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: xkill on ds3100 Xmfb.new hangs server run locally on my ds3100, was a no-op. run from rosemary, i got the xkill cursor and then my entire window system froze up as if it were waiting for a keypress but not getting one. killing the xkill process didn't help and i had to restart my window system. 865. Date: Wed, 24 Jan 90 11:23:09 PST From: brent (Brent Welch) Subject: Fs_Select and RPC_TIMEOUT I was looking at the fs code to see what happens when you select on a remote device and its I/O server is down. Currently, RPC_TIMEOUT will be returned and the corresponding mask bit won't be cleared. This seems wrong. Either the mask bits should be cleared (I guess this won't matter), or the RPC_TIMEOUT should be hidden. I think that perhaps the RPC_TIMEOUT should be hidden. This will cause the process to wait in Fs_Select until its timeout period expires, or until another stream becomes ready, or until after the recovery protocol completes. In the current implementation the RPC_TIMEOUT return from Fs_Select will give the application no help in determining what stream failed. Any strong opinions? 866. Date: Wed, 24 Jan 90 14:11:11 PST From: culler (David Culler) Subject: Objectable Ultrix Objects Any thoughts on this one. I have brought over an Ultrix Ds3100 binary for allegro common-lisp. When I try to run it, it acts like it isn't an executable file, but FILE thinks it is. Here's dribble: [cardamom]/users/culler/cl/bin (7)% cl cl: syntax error at line 2: `(' unexpected [cardamom]/users/culler/cl/bin (8)% file cl cl: mipsel 407 executable not stripped - version 1.31 [cardamom]/users/culler/cl/bin (9)% 867. Date: Fri, 26 Jan 90 09:04:33 PST From: ouster (John Ousterhout) Subject: Re: Fs_Select and RPC_TIMEOUT You asked what to do when someone selects on a remote device and its I/O server is down. To resolve this, I'd suggest mimicking what happens in UNIX when you select on a network connection that has been closed from the other end (or try selecting on a UNIX tape drive that is off-line). I suspect that either the "exception" condition is set, or else the device is considered to be both readable and writable and then when you try to read or write it an error gets returned. Unless UNIX is totally brain-damaged, I think the most important thing is to do what it does. 868. Date: Fri, 26 Jan 90 09:12:51 PST From: sullivan (Mark Sullivan) Subject: rcp error I was rcp'ing my .login from postgres to babylon. From postgres "rcp .login babylon:.login" worked fine. From Babylon, "rcp postgres:.login .login" tells me: rcp: protocol screwup: mtime.sec not delimited 869. Date: Mon, 29 Jan 90 08:21:48 PST From: rab (Robert A. Bruce) Subject: mach header files The header files mach/*.md/compatSig.h and mach/*.md/compatInt.h are symbolic pointers to /sprite/src/lib/c/unixSyscall/compat???.h. I don't think this is a good thing. If these files are used by external routines, then they should be installed in the standard include directories. 870. Date: Mon, 29 Jan 90 10:47:25 PST From: Fred Douglis <douglis> Subject: /sprite/admin/hosts is a handy file but is terribly out of date (last updated 12/15). the line in howto/addNewHost mentions this file, but i gather that the new script doesn't do anything about updating it. can the recent additions be added to this file? i use it sometimes when rup indicates that a machine is down and i'm wondering if it was a temporary machine or is something potentially worth debugging. Date: Mon, 29 Jan 90 11:42:56 PST From: tve (Thorsten von Eicken) Subject: more spring cleanup 1) How 'bout making mx/tx ICCCM conformant? That's the standard which is supposed to allow all X clients & window managers to talk/snarf/paste among each other. 2) How 'bout gettying serious about access modes and groups? 3) While you're at hacking tx/mx, how 'bout allowing people to have the scrollbar on the *left* of the window? 872. Date: Mon, 29 Jan 90 15:01:53 PST From: Fred Douglis <douglis> Subject: select broken? i am running tx on a ds3100 and selected some text. "^v" worked fine, but "select" produced no output. i knew there was a problem using tx/select between hosts of different byte orders but thought that on a single host it should work fine. no such luck. 873. Date: Tue, 30 Jan 90 11:51:16 PST From: douglis (Fred Douglis) Subject: sun4 cc bug not only did i get /tmp/cc931124.s:1574:End-of-File not at end of a line but i got it in an infinite loop. is this a known bug? 874. Date: Tue, 30 Jan 90 13:30:21 PST From: shirriff (Ken Shirriff) Subject: pmake bug, kernel warnings Pmake on the ds3100 gave me several "***Error code 1" messages during kernel module compiles for no apparent reason. I recompiled the modules without problem. The following warning messages arose during the compiles: sun4c.md/uword.c: In function read_iureg: sun4c.md/uword.c:80: warning: assignment between incompatible pointer types sun4c.md/uword.c:84: warning: assignment between incompatible pointer types sun4c.md/vmSun.c:2473: warning: `VmMachFlushWholeCache' was previously implicitly declared to return `int' sun4c.md/devVidSun4.s:23: warning: SUCCESS redefined sysTestCall.c: In function Test_PrintOut: sysTestCall.c:32: warning: structure defined inside parms sun4.md/devJaguarHBA.c:1406: warning: assignment of pointer from integer lacks a cast 875. Date: Tue, 30 Jan 90 15:59:06 PST From: tve (Thorsten von Eicken) Subject: Re: pmake bug, kernel warnings Looks a lot like the "Error code 6" problems I've had a few weeks ago. The only fix I came across was to use pmake -k (everything was non-deterministic and in all cases the programs ran just fine), (I wasn't doing compilations). 876. Date: Tue, 30 Jan 90 16:00:19 PST From: tve (Thorsten von Eicken) Subject: Re: sun4 cc bug Yes, known bug: I already sent in 2 "bug reports". I get the same symptoms. Compile on a sun3 meanwhile (if there still is one on sprite)... 877. Date: Tue, 30 Jan 90 17:22:10 PST From: shirriff (Ken Shirriff) Subject: sun4c compiler problem Compiling kernel/libc/fmt.c for the sun4 on the sun4c gives a bunch of parse errors. It works if I compile on a sun3. 878. Date: Tue, 30 Jan 90 22:08:07 PST From: Fred Douglis <douglis> Subject: out of processes mary's sun4c X server looped and caused her machine to fill up with xgone processes due to a script that runs periodically. when the machine gets into the mode of "no more processes" the feces hit the fan. what would people think of a soft limit on the number of processes that can be created by users other than root, to allow root the ability to still create processes even if a user creates far too many? this would be analogous to the BSD 10% hidden disk space that only root can write to. 879. Date: Wed, 31 Jan 90 09:11:05 PST From: brent (Brent Welch) Subject: stream recovery bug found I found a bug in the recovery code for streams. The server doesn't check that a client's stream hooks up to the same I/O handle as the server's. However, it does make this consistency check during I/O operations. Paprika and Kvetching were in recovery loops with Mint because of this. They'd get a FS_STALE_HANDLE from Mint on a Fs_PageRead, and then go through recovery ok. The bug meant that the erroneous client stream->ioHandle setup wasn't caught until the next I/O attempt. The reason that the server's stream doesn't hook to the same I/O handle as the client's is that there was a long network partition (hours long) and Mint reused a streamID. Ideally this wouldn't happen, but it does, and the state recovery code should guard against it. This bug has been around forever. brent ps. Anise did not go through recovery after the partition. However, I was able to rlogin into it from Allspice, and this triggered recovery with all the servers. So, there is still a bug left. 880. Date: Wed, 31 Jan 90 09:57:03 PST From: pmchen (Peter M. Chen) Subject: mysterious crash I've been getting consistent crashes from a simulation program. The crashes are pretty intermittent, but when running thousands of them (one after another), they crash the machine (ds3100) once every thousand or so (this takes many hours, sometimes 10-20 hours). The crashes seem to wipe out the screen, when it's running X, that is. Once I ran it on a raw console to see what the error messages were and it seemed to have floating point implications. The crashes have occurred on the "new" kernel (before the latest install), and also on the "ken" kernel (WITH the floating point fix). I'm now running it on apathy (without a window system, so we can see the error messages) using the new (after the install) kernel. 881. Date: Wed, 31 Jan 90 10:41:24 PST From: Fred Douglis <douglis> Subject: strange interaction between tx, emacs, and tcsh if i start tx from my x startup script, it creates windows with tcsh shells in them, and everything works. if i start tx from tx, it works okay too. if i start tx from an emacs subshell, my tx starts up in a funny mode where it doesn't echo characters. stty reset fixes that but also leaves tcsh out of the loop since it is no longer in raw mode. starting "tx < /dev/null" didn't help. ideas? 882. Date: Wed, 31 Jan 90 14:23:35 PST From: pmchen (Peter M. Chen) Subject: mysterious crash I crashed apathy, which was running the new "new" kernel. The screen blanked again, and it WASN'T running X. It crashed when running a csh script which invoked a simulator many times in succession (different parameters). Because the screen blanked, I didn't see the error log. The script is in ~pmchen/striping/simul/ex/xxdisk. The simulator is in ~pmchen/striping/simul. 883. Date: Wed, 31 Jan 90 16:01:07 PST From: gibson@rosemary.Berkeley.EDU (Garth Gibson) Subject: Pete's mysterious crash on apathy Pete asked me to describe what I saw during the crash. Out of the corner of my eye, while working on paper, I noticed the screen going crazy. The pixels were changing rapidly; there appeared to be a diagonal pattern, but it was more likely a rapidly stirred soup. This lasted a fraction of a second and then the screen went blank. As Peter was "in charge" of this experiment, I left everything alone (and worked on basil). 884. Date: Wed, 31 Jan 90 17:09:51 PST From: mgbaker (Mary Gray Baker) Subject: rsh rosemary An rsh to rosemary from treason gives the response ^ATry again. 885. Date: Wed, 31 Jan 90 21:14:26 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bootp man page Bootp doesn't have a man page. 886. Date: Thu, 1 Feb 90 11:59:02 PST From: pmchen (Peter M. Chen) Subject: MAIL file corrupted I lost the tail end of /sprite/spool/mail/pmchen. It got overwritten by the next mail. This is bad. 887. Date: Thu, 1 Feb 90 14:20:38 PST From: tve (Thorsten von Eicken) Subject: Do processes *have* to go into the DEBUG state? In UNIX/csh one can say "limit coredumpsize=0" and processes which fault just die without producing a core dump. It would be nice if sprite had a similar mechanism! Would it be easy to kill the prrocess (instead of having it hang in DEBUG) if it's environment contains a variable of name "NODEBUG"? Or any equivalent mechanism... 888. Date: Thu, 1 Feb 90 14:33:23 PST From: tve (Thorsten von Eicken) Subject: no more processes I had X up, and was doing a pmake, all on crackle (sun4c with 1.055). The I type an innocuous command in a tx window, and got "no more processes" error message (in tx, NOT in the syslog). Then X died away (nicely, i.e. closing down all connections) and the syslog on the console reported a MachBusError (or whatever..). Finally I got logged off (normal thing after X dies). I logged back on and I saw that a cc1.sparc was in DEBUG. I got the "no more processes" message once before: yesterday evening (running the new 1.055 kernel too), but the consequences were less dramatic. I tried ps, but same message. I killed a few tx's and ps -a didn't show any unusual processes, and not "too many" in any case (I might have been too slow, and should have typed F1-p anyway). 889. Date: Thu, 01 Feb 90 14:36:41 PST From: tve (Thorsten von Eicken) Subject: Re: Do processes *have* to go into the DEBUG state? Please don't get rid of the concept: nothing more painful than a 20Meg process which core dumps. *You* might not have such processes (compilers don't get that big, do they?) but people doing VLSI crap get much larger processes! 890. Date: Thu, 1 Feb 90 14:57:24 PST From: tve (Thorsten von Eicken) Subject: ouch on sun4 Fs_PageCopy: Copy failed <40008> Couldn't fork! errno = 22 exiting. This appeared on my syslog and my xmh went away. Any clues? 891. Date: Thu, 1 Feb 90 16:28:33 PST From: tve (Thorsten von Eicken) Subject: details on sun4 cc going into debugger OK, I'm running on a sun4c, kernel 1.055, I'm using /sprite/src/cmds/cc/sun4.md/cc because the old version always dies on the imfamous "End-of-File not at end of a line" bug, and I reported this bug before but never really got any answer. The very first reference I could find dates back to Dec 13 by Mary (~sprite/Log/log/01017). Is the bug too difficult to fix? Or can't you reproduce it? Ok, here some pieces of info on *one* instance of the problem, I can give you as many instances as you like... SYSLOG: MachPageFault: Bus error in user proc 63754, PC = 39e50, addr = 4f2050 BR Reg 80 PS: [crackle sun4.md] ps -w 160 -d PID STATE TIME COMMAND 63754 DEBUG 0:06 /sprite/lib/gcc/sun4.md/cc1.sparc /tmp/cc538448.cpp -quiet -dumpbase wireratio.c -msun4 -fwritable-strings -g -O -o /tmp/cc538448.s [crackle sun4.md] ps -dM PID STATE FLAGS EVENT RNODE RPID COMMAND 63754 DEBUG 4002 ffffffff /sprite/lib/gcc/sun4.md/cc1.spa... [crackle sun4.md] sysstat crackle SPRITE VERSION 1.055 (sun4c) (31 Jan 90 17:15:11) [I.e. it died on MY machine which runs 1.055] PMAKE: --- sun4.md/wireratio.o --- rm -f sun4.md/wireratio.o cc "-DCADROOT=\"/users/octtools\"" -fwritable-strings "-DCUR_DATE=\"`date | awk '{print %2, %3, %6}'`\"" "-DCUR_TIME=\"`date | awk '{print %4}'`\"" -g -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -I/users/octtools/lib/include -c wireratio.c -o sun4.md/wireratio.o GDB: [crackle sun4.md] gdb cc1.sparc Reading symbol data from /sprite/lib/gcc/sun4.md/cc1.sparc...done. (gdb) attach 0x63754 Attaching program: /sprite/lib/gcc/sun4.md/cc1.sparc pid 407380 Reading in symbols for final.c...done. 0x39e50 in output_address (x=(rtx) 0x10f8c8) (final.c line 1528) final.c: no such file or directory. (gdb) where #0 0x39e50 in output_address (x=(rtx) 0x10f8c8) (final.c line 1528) #1 0x39aac in output_operand (x=(rtx) 0x10f8d8, code=0) (final.c line 1516) #2 0x39884 in output_asm_insn (template=(char *) 0x0, operands=(rtx *) 0x1dffdf18) (final.c line 1463) #3 0x454ac in output_fp_move_double (...) (...) #4 0x48d94 in output_54 (...) (...) #5 0x38cb0 in final_scan_insn (insn=(rtx) 0x10c5b0, file=(struct _file *) 0xc8708, write_symbols=311236, optimize=1, prescan=0, nopeepholes=54) (final.c line 1004) #6 0x37e40 in final (first=(rtx) 0xcf260, file=(struct _file *) 0xcc070, write_symbols=DBX_DEBUG, optimize=1, prescan=0) (final.c line 531) #7 0x94b2c in rest_of_compilation (...) (...) #8 0x9394 in finish_function (...) (...) #9 0xd168 in yyparse (...) (...) #10 0x937a0 in compile_file (...) (...) #11 0x95710 in main (...) (...) (gdb) 892. Date: Thu, 1 Feb 90 16:29:13 PST From: brent (Brent Welch) Subject: Allspice crash Allspice crashed mysteriously. After getting RPC timeout messages I went in to check it out. There were a few messages on the console about hosts going through recovery. I hit return on the console and immediately it went into the debugger with an "unaligned address in kernel". Mendel ran the debugger and found that a call to Fsutil_WaitListInsert had a bogus pointer. The only other clue was that allspice had previously filled up its disk, but then I reclaimed the space by rebooting fenugreek so that it deleted its swap files. Also, Allspice had just created all the RPC server processes it was allowed to. So something funny was going on, but we don't know what. 893. Date: Thu, 1 Feb 90 16:58:11 PST From: mgbaker (Mary Gray Baker) Subject: ds3100 TLB fault or such? This is a rather vague bug report. After allspice recovered, Bob Miller's machine didn't quite come back. The mail process in one window was hung. Pinging his machine from allspice with "rpccmd -ping" didn't do anything. There were a lot of Skipping Stream messages in the syslog, and also a TLB fault message. I didn't know what to tell him about why this happened. Maybe somebody can tell me? 894. Date: Thu, 1 Feb 90 17:40:23 PST From: pmchen (Peter M. Chen) Subject: cc bug on sun4c I've heard about this bug for a while Thorsten's mail. It's the "End-of-File not at end of a line" bug when pmaking on a sun4c for a sun4. To duplicate, pmake in ~pmchen/striping/simul. (pmake clean first). When I typed in the offending cc line cc -g -O -msun4 -Dsprite -Dsun4 -I/users/pmchen/lib/include -I. -Isun4.md -c simul.c -o sun4.md/simul.o I got MachPageFault: Bus error in user proc 21d34, PC = 39e50, addr = 49c1b0 BR Reg 80 consistently. Compiling on a sun3 for a sun4 works fine. 895. Date: Thu, 1 Feb 90 17:58:57 PST From: tve (Thorsten von Eicken) Subject: Re: details on sun4 cc going into debugger On a ds3100, using /sprite/src/cmds/ds3100.md/gcc, cc1.sparc goes into the debugger on the same file. And I just checked: on a sun3 it also goes into the debugger, as a bonus, it spits out lines like this: T(875) 0 0x00000000 0x00000000 0x00000000 0x00000000 T(1054) 0 0x00000000 0x00000000 0x00000000 0x00000000 T(2287) 0 0x00000000 0x00000000 0x00000000 0x00000000 T(2308) 1 0x00000000 0x00000000 0x00000000 0x00000000 T(1030) 2 0x00000000 0x00000000 0x0008bc72 0x0008bc9e In other words, the only way I can compile, is to use the old (i.e. installed) compiler on a sun3!!! 896. Date: Thu, 1 Feb 90 18:37:36 PST From: brent (Brent Welch) Subject: Oregano crash, cache deadlock The nfsmount process that hung up on Oregano was stuck trying to remove a swap file. This is getting to be a familiar cause of death for the servers. Here's a stack trace: #0 0xe004136 in Mach_ContextSwitch () #1 0xfeedbabe in ?? () #2 0xe060c42 in SyncEventWaitInt (...) (...) #3 0xe060358 in Sync_SlowWait (...) (...) #4 0xe021e28 in GetUnlockedBlock (...) (...) #5 0xe02092c in CacheFileInvalidate (...) (...) #6 0xe02050c in Fscache_UnlockBlock (...) (...) #7 0xe01f29c in FreeIndirectBlock (...) (...) #8 0xe01efd6 in Fsdm_EndIndex (...) (...) #9 0xe01b558 in Fsdm_FileDescTrunc (...) (...) #10 0xe028500 in Fsio_FileTrunc (...) (...) #11 0xe02d556 in Fslcl_DeleteFileDesc (...) (...) #12 0xe027568 in Fsio_FileCloseInt (...) (...) #13 0xe02d3da in DeleteFileName (...) (...) #14 0xe02c392 in FslclLookup (...) (...) #15 0xe02ba14 in FslclRemove (...) (...) #16 0xe032896 in Fsprefix_LookupOperation (...) (...) #17 0xe0174cc in Fs_Remove (...) (...) #18 0xe072dd8 in VmSwapFileRemove (...) (...) #19 0xe06ae7e in DeleteSeg (...) (...) #20 0xe06add2 in Vm_SegmentDelete (...) (...) #21 0xe04cbc6 in ProcExitProcess (...) (...) #22 0xe04c702 in Proc_ExitInt (...) (...) ---Type <return> to continue--- #23 0xe05fdf6 in Sig_Handle (...) (...) #24 0xe004ff0 in MachUserReturn (...) (...) #25 0xe004f6a in MachTrap (...) (...) #26 0xe0063c0 in MachBusError () (gdb) fram 4 Reading in symbols for fsBlockCache.c...done. #4 0xe021e28 in GetUnlockedBlock (blockHashKeyPtr=(BlockHashKey *) 0xe9b7814, blockNum=-2) (fsBlockCache.c line 3042) Source file is more recent than executable. 3042 (void) Sync_Wait(&blockPtr->ioDone, FALSE); (gdb) p *blockPtr ERROR: invalid read address 0x0 (gdb) p blockPtr ERROR: invalid read address 0x0 As you can see, I couldn't look around very well to see exactly what was wrong. My rememberance from past experiences is that somehow a cache block gets a zillion references to it, and GetUnlockedBlock waits (forever) for these references to go away. Glancing at the end of the stack trace again: #3 0xe060358 in Sync_SlowWait (...) (...) #4 0xe021e28 in GetUnlockedBlock (...) (...) #5 0xe02092c in CacheFileInvalidate (...) (...) #6 0xe02050c in Fscache_UnlockBlock (...) (...) #7 0xe01f29c in FreeIndirectBlock (...) (...) #8 0xe01efd6 in Fsdm_EndIndex (...) (...) This highlights some of the over-generality of the cache implementation. The CacheFileInvalidate is a very general routine that rehashes to get the block, works on a range of blocks, etc. It seems clear that this block just needs to be marked for deletion inside Fscache_UnlockBlock and evenntually nuked when it has no more users. Any takers? 897. Date: Thu, 1 Feb 90 21:22:31 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: ds3100 TLB LD miss Hijack went into the debugger on a TLB load miss in the kernel. It was right in the middle of restoring registers after a call, and one of the middle loads suffered the fault. This was not on a page boundary. Furthermore, there is a valid tlb entry for the page. A while ago I changed the kernel exception handler to store away each virtual address that caused an exception so I could debug stuff like this. Unfortunately this variable contained a different address than the one indicated by the pc. This different address was in the kernel heap and was also valid. This whole thing has me stumped because I can't figure out any way for the kernel to enter the debugger on this particular exception without entering the kernel exception handler first. 898. Date: Thu, 1 Feb 90 22:00:29 PST From: shirriff (Ken Shirriff) Subject: Mail got clobbered My mail file just got clobbered. Here's the sequence of events: I had new mail: messages 180, 181, and 182. I read 180 and deleted it. I sent a reply to 181. I deleted 182. Then I left mail. In my syslog I got a message from sendmail that the reply I'd tried to send had failed. I got a message "New mail has arrived", so I went back into mail. Mail had restored as message 180 the message 180 that I'd deleted. However, the mail from the mailer-daemon about my failed reply had clobbered the last half of message 180. Message 181 disappeared totally. I have a copy of my tx transcript and a copy of my mail file if anyone has ideas on how to track this down. 899. Date: Thu, 01 Feb 90 22:34:49 PST From: Fred Douglis <douglis> Subject: Re: Mail got clobbered It sounds an awful lot like (1) flock() is broken, (2) some program isn't behaving itself, or (3) there's some sort of cache consistency problem. If your example provides a repeatable test case it should be easy enough to track and would probably be due to 1 or 2. If not, it'll be much harder. I'll check it out tomorrow. 900. Date: Fri, 02 Feb 90 10:32:08 PST From: Fred Douglis <douglis> Subject: memory trashing problem on ds3100 twice in the past two days my machine has crashed with "Mem_Free: storage block already free". The first time, the kernel never fully entered the debugger, but today it did. The string being freed was undamaged, which implies that only the admin block got corrupted. I only started seeing this bug fairly recently (1.053 and 1.055 kernels). has anyone else hit it? anyone have suggestions for what might have changed? 901. Date: Fri, 02 Feb 90 10:59:50 PST From: Fred Douglis <douglis> Subject: trashed file client list when out of segments i tried to start up one tx/rlogin per host. my system wedged, with a process in the RUNNING state and lots of other processes READY. the debugger showed that this process was in a LIST_FORALL in Fsio_StreamClientClose with a garbaged client list. It got there because it ran out of segments and tried deleting a segment it had just allocated. no idea why the list was bad, but it could be related to the memory trashing but i just reported. 902. Date: Sat, 3 Feb 90 14:30:46 PST From: shirriff (Ken Shirriff) Subject: mail problems / dec crashes (investigation) I looked at the mail file truncation problem yesterday with inconclusive results. I had a loop: while (1) echo test | mail shirriff to send me a bunch of mail messages. Meanwhile I would enter and exit mail. (This was on the ds3100 running the new kernel.) This would fairly regularly mangle one of the messages. It also fairly regularly produced the crashes John and Fred have seen: TLB load miss and page already free (sorry, I can't remember exactly what it was, but it was the same one Fred saw). Then I tried this on an old kernel to see if the problem happened, but it wouldn't recur. Instead mint would get unhappy and do RPC timeouts for a minute. I went back to the new kernel and I still couldn't get the truncation/crashes to recur. I am unsure what conclusions to draw from this. The earlier crashes were around 2PM, while the later non-crashes were around 5PM, so perhaps other activity on the system triggers the problem? 903. Date: Sat, 3 Feb 90 18:05:56 PST From: pmchen (Peter M. Chen) Subject: decstation crashes I have a fairly repeatable crash mode on apathy (and other ds3100's). I can consistently get a crash within 3 hours by running the same simulation (same input parameters) for a couple hundred times. The kernel running is "new", though I've crashed non-"new" kernels too. The program uses floating point. This is what we've been referring to as the "mysterious" crash. Garth specifically reported seeing the screen go crazy (it was not running X), then blank. Since the screen goes blank, I can't read the syslog by normal (on the screen) means. I've tried cat-ing /dev/syslog to a file (/tmp/apathy.syslog) and reading from another machine to make it non-cacheable, but the syslog hasn't printed out anything unusual. After a crash, the machine frequently prompts for language type upon reboot. This makes me suspect that the battery backed up memory might be getting trashed. The program to run is a script in ~pmchen/striping/simul/ex/debug. The script can be run by: cd ~pmchen/striping/simul mv out/debug out/debugsave (or some other non-existent directory) bin/go debug 904. Date: Sun, 4 Feb 90 11:22:22 PST From: fwo (Fred W. Obermeier) Subject: Makefile incompatibility Hi, I haven't switched over to pmake yet, but I noticed that SPRITE's Makefile does not support "include filename" statements as BSD UNIX and SUNOS do. 905. Date: Sun, 4 Feb 90 13:47:43 PST From: mgbaker (Mary Gray Baker) Subject: molasses on oregano If I do a "df" on treason, I get RpcDoCall: <domain info> RPC to oregano is hung After a few minutes I gave up and typed some ^C's. Then after about another minute the prompt finally came back. Opens and other rpc's are also timing out. A second df also hung and no amount of ^C's will kill it. An rpcstat -srvr on oregano shows that a server process thinks it's busy handling a domain info rpc for treason. But it doesn't seem to be getting anywhere. 906. Date: Mon, 5 Feb 90 17:34:20 PST From: brent (Brent Welch) Subject: /sprite2 The problem with /sprite2 is the core leak in nfsmount. The nfsmount process for /sprite2 now has a 20Meg virtual image size. Every access to /sprite2 causes lots of paging, and it's slow. Supposedly there is a core-lead in the SUN RPC library that nfsmount uses. There may be a leak in nfsmount itself. I don't have the time right now to fix this. 907. Date: Mon, 05 Feb 90 18:28:52 PST From: Fred Douglis <douglis> Subject: IOC_SET_OWNER not byteswapped If a process opens a pdev on another host, and they are of different byte orders, then IOC_SET_OWNER will generate a bogus processID because the pid won't be swapped. 908. Date: Mon, 05 Feb 90 19:48:50 PST From: Fred Douglis <douglis> Subject: Fs_Dispatch exits instead of propagating errors as you may see from the following code fragment, i have a problem: numReady = select(maxPossNumStreams, tempReadMask, tempWriteMask, tempExceptMask, (struct timeval *) timeoutPtr); if (numReady == 0) { /* * Nothing happened on the streams but a routine in the timeout * queue needs to be called now. */ CallTimeoutHandler(); fsNumTimeoutEvents++; } else if (numReady < 0) { if (errno != EINTR) { fprintf(stderr, "Fs_Dispatch select error: %s\n", strerror(errno)); exit(1); } so, if someone gets an I/O error because they select on a pdev whose master goes away, then Fs_Dispatch exits rather than allowing the caller to take some other action. (in my case, i would try to create a new master.) what do you think is the right solution here? will anything break if Fs_Dispatch ignores EIO as well? should it ignore more than that? maybe allow the user to register a callback in case of errors, or something? i'm happy to implement any of the above but am wondering about the long-range implications since i don't know all the programs that use Fs_Dispatch. 909. Date: Tue, 06 Feb 90 10:41:20 PST From: Fred Douglis <douglis> Subject: wedged prefix there for keeps I noticed that sprite2 was hanging and in fact had never been restarted, and it had a swap image of 20MB again, so i tried to kill it. silly me. it wedged again, so instead of hanging RPCs eventually being ok, now they just hang forever. if i go to another machine and try "prefix -d /sprite2; nfsmount..." it lets me clear the prefix but then imports it from oregano again rather than letting the new host export it. does the code to export a prefix first broadcast to see if someone else is exporting it already? 910. Date: Tue, 6 Feb 90 10:53:36 PST From: mendel (Mendel Rosenblum) Subject: sparcStation watchdog reset That nasty dog is back on jaywalk. It was running the new (1.055) kernel and I was trying to print a paper on lw477. It started with a wild video display followed by the watchdog reset. The watchdog was causes by a panic() that panic'ed because the MASTER_LOCK found an negative interrupt count. (Bug #1 - Panic probably shouldn't call itself infinitely.) I couldn't find the start of the stack to figure out what causes the initial panic(). It appeared to be the stack of an RPC server. proc_RunningProcesses[0] was NIL so it could for been running at interrupt level when the problem happened. (Bug #2 - From experience with RAID, it seems like a panic() at interrupt level doesn't make it into the debugger on the sun4's.) 911. Date: Tue, 6 Feb 90 15:08:13 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: separate config files for ipServer There are separate copies of the ipServer configuration file in each of the /sprite/daemons.* directories. This file should be moved elsewhere, or symbolic links should be made. Better yet, move the daemons to /sprite/daemons/cmds, and leave the configuration files in /sprite/daemons. 912. Date: Tue, 6 Feb 90 21:28:00 PST From: ouster (John Ousterhout) Subject: Re: separate config files for ipServer I think the ipServer should use the standard library facility: there should be a directory /sprite/lib/ipServer that contains the config file, or perhaps /sprite/daemons/lib/ipServer. Ditto for inetd, which has its own configuration file (inetd.conf). 913. Date: Wed, 7 Feb 90 00:12:34 PST From: shirriff (Ken Shirriff) Subject: ds3100 OMAGIC files. The good news is OMAGIC files will work on the ds3100 now. The bad news is that ~culler/cl/bin/cl (the lisp interpreter I implemented OMAGIC files for) prints a bunch of "MachUNIXGetDirEntries: Bad directory format" in the syslog and then ends with: "The environment this lisp is running in has used up too much stack. It cannot be restarted" 914. Date: Wed, 7 Feb 90 09:51:25 PST From: brent (Brent Welch) Subject: Re: status of /sprite2 ? The nfsmount on /sprite2 ends up with a huge virtual memory image because of a core leak. Even worse, killing this process causes it to deadlock within the FS cache. So, there are two bugs that need to be fixed. The reason this have just recently caused a problem is that the "bounds" file for msgs was changed and the result is a huge number of lookups for non-existent messages files. I think that each RPC leaks memory in nfsmount, so it has been ending up with a 20Meg memory image, and it pages on almost every RPC. 915. Date: Thu, 8 Feb 90 14:43:15 PST From: brent (Brent Welch) Subject: file corruption still happens I deliberatly teamed up all the machine types against the file servers today, and I observed one instance of file corruption, ack! It was a file in the net module, and Allspice is running a kernel with my last fix to the UpgradeFragment code. The beginning part of a file was mangled. Oh, how exciting, the file's contents have changed since I looked at it first. The initial fragment of this file is being erroneously shared with other files. A race still lurks in the fragmenting code. 916. Date: Thu, 8 Feb 90 18:12:21 PST From: mgbaker (Mary Gray Baker) Subject: Jaywalk turned to molasses Jaywalk turned to molasses tonight. It was getting page faults where every page fault caused a pmeg to be stolen. 917. Date: Fri, 9 Feb 90 15:48:19 PST From: root (The Sprite God) Subject: system wedging problems sprite has been behaving badly this afternoon. first allspice's load hit over 20 and allspice didn't respond to anything, including keyboard input. i finally hit the watchdog reset button and rebooted. about 15 minutes after it came back, its load shot up again. this time kvetching wedged up, and when i came upstairs i couldn't login as root on mint or allspice. i noticed mint was complaining about consist timeouts with kvetching so i ran l1d from ginger and that cleared things up once kvetching was in the debugger. (on kvetching, by the way, processes were in the DEAD state -- surely a bad sign). i'll debug kvetching if i can. once mint came back, allspice went into an infinite recoveryloop with terrorism, with allspice complaining that something didn't have a handle. i threw terrorism into the debugger as well. 918. Date: Fri, 09 Feb 90 17:19:17 PST From: Fred Douglis <douglis> Subject: /swap1: directory from hell it seems that anything trying to open /swap1 or anything below it hangs, at least in some circumstances. that would also explain why processes can't exit on some machines. this hoses migration and some other things. with all the wedging we've had all afternoon, i hate to debug allspice, but at the same time we should probably do it while we can. 919. Date: Fri, 9 Feb 90 17:51:01 PST From: douglis@ginger.Berkeley.EDU (Fred Douglis) Subject: allspice deadlock allspice deadlocked on /swap1. it ran out of rpc_servers because pride kept broadcasting for /swap1 once it had been rebooted, and each one chewed up a server. (we must be able to do something about that....) one process that looked suspicious was a server for piracy's remove of a swap file. the backtrace looked a lot like oregano's backtrace of the wedged nfsmount: #3 0xf60385d8 in GetUnlockedBlock (...) (...) #4 0xf6036858 in CacheFileInvalidate (...) (...) #5 0xf6036230 in Fscache_UnlockBlock (...) (...) #6 0xf6034964 in FreeIndirectBlock (...) (...) #7 0xf6034550 in Fsdm_EndIndex (...) (...) #8 0xf602f520 in Fsdm_FileDescTrunc (...) (...) #9 0xf6040e70 in Fsio_FileTrunc (...) (...) #10 0xf6047e7c in Fslcl_DeleteFileDesc (...) (...) #11 0xf603f9e8 in Fsio_FileCloseInt (...) (...) #12 0xf6047c98 in DeleteFileName (...) (...) #13 0xf6046624 in FslclLookup (...) (...) #14 0xf604588c in FslclRemove (...) (...) #15 0xf60551d0 in Fsrmt_RpcRemove (...) (...) the full trace, with args, is in rosemary:/tmp/sprite/allspice.log. it indicates one other blocked process of note, by the way: 3 0xf60580d4 in Fsutil_HandleFetch (fileIDPtr=(struct Fs_FileID *) 0xf8077a48) (fsHandle.c line 542) #4 0xf6057a0c in Fsutil_HandleInstall (...) (fsHandle.c line 282) #5 0xf603eeac in Fsio_LocalFileHandleInit (...) (fsFile.c line 80) #6 0xf6046a68 in FindComponent (...) (fsLocalLookup.c line 772) #7 0xf6046004 in FslclLookup (...) (fsLocalLookup.c line 330) #8 0xf604545c in FslclGetAttrPath (...) (fsLocalDomain.c line 233) #9 0xf6051484 in Fsrmt_RpcGetAttrPath (...) (...) in each case they were blocked at these points. other processes were blocked trying to access /swap1. 920. Date: Sat, 10 Feb 90 14:22:15 PST From: mendel (Mendel Rosenblum) Subject: lint on the sun4 I installed the lint program for the sun4 and it appears to work except that it can't read lint libraries created by the other machines. The problem is that lint libraries are binary files with structures containing shorts and ints. This means that the sun4 lint can't read libraries generated on the ds3100 or sun3. Unfortunately, most the the lint libraries for the sun4 have been generated on the sun3. This problem is compounded by the flaky compilers on the sun4 that make the sun3 the only stable base for compiling for the sun4. 921. Date: Sun, 11 Feb 90 18:22:57 PST From: Fred Douglis <douglis> Subject: allspice wedged again with /swap1 locked. jhh and i poked around in the debugger but didn't find anything new. we decided to reboot with "sun4" instead of "brent" in the hope that the bug is new. let's see if allspice does any better this time around. by the way, most of what it was wedged on seemed to be related to kvetching. allspice was running BW.234. kvetching is running 1.056. i am not aware of anything in particular i might be doing on kvetching to cause this problem. at the time allspice started wedging, kvetching froze up and i couldn't do too much, but i could see that it was trying to do lots and lots of pageouts. 922. Date: Mon, 12 Feb 90 09:04:06 PST From: brent (Brent Welch) Subject: Re: allspice wedged again I think that all the Allspice problems are related to the cache deadlock. This bug is the primary cause of server death. 923. Date: Mon, 12 Feb 90 17:31:27 PST From: shirriff (Ken Shirriff) Subject: Mail problems These both seem to be problems with the new version of mail. 1. I can't use ^C to get out of mail. After the first one it says (Interrupt -- one more to kill letter) but then any more get ignored. 2. Sometimes mail won't let me use ~ escapes. I'll be typing, I'll use ~v to get to vi, I'll leave vi, and then no more ~ escapes will work; they just end up in my file. I can't repeat this at will, but it's happened twice. 924. Date: Mon, 12 Feb 90 17:51:01 PST From: pmchen (Peter M. Chen) Subject: mail The mail program on decstations has been flaky recently. For example, when coming out of vi mode, often the vi process hangs and mail is unable to go on. This happens most frequently when I exit vi (ZZ) and typeahead a ~p. I am also consistently unable to control-C out of mail (try it). The mail process hangs in the RWAIT state. You can kill it manually, but not by control-C or control-Z. 925. Date: Tue, 13 Feb 90 14:10:26 PST From: brent (Brent Welch) Subject: cache deadlock fixed I found the problem with the cache code that has been plaguing Allspice lately. I'm off to lunch, and if it dies in the meantime you should reboot with my sun4 kernel (BW.239). (This image will be copied to rosemary:/tmp/brent/sun4.brent) The bug had to do with my non-blocking cache block fetches. I added this last fall to fix a different problem, and in one case I wasn't backing out right. If a file was into doubly indirect blocks it might successfully grab one block but then not be able to get the second. It didn't back out from this case right and left an extra reference to the (all important) first level block. The fix is in fsdm, and I've also tidied up fscache (removed some old warnings). I'll install these and set up a new kernel soon. In the meantime my sun3 and sun4 kernels have the fix in them. 926. Date: Tue, 13 Feb 90 14:54:32 PST From: ouster (John Ousterhout) Subject: Eviction problem on ds3100's? I've been using migration to do some compiles at WRL today, and I've noticed that every time I cause a process to be evicted, pmake gets "*** Error code 1" and quits. This has been 100% reproducible here (i.e. 3/3 times). Perhaps there is something about the WRL environment that is causing this problem, but I vaguely remember people reporting similar symptoms at Berkeley.... could this be a consistent problem that hasn't been repeatable because eviction is infrequent? 927. Date: Tue, 13 Feb 90 22:05:50 PST From: pmchen (Peter M. Chen) Subject: mail problems Mail continues to have problems on the decstations. Once, I wasn't able to control-Z out of mail (it hung and had to be killed manually). Another time, a subprocess executing "folders" hung (the ls /users/pmchen/mail process hung in the EXIT state). Can we get back to a reasonable version (the old one was fine) until a more mature version exists? (While mailing this bug report, mail died (the vi process hung in the EXIT state)). 928. Date: Tue, 13 Feb 90 22:49:44 PST From: Fred Douglis <douglis> Subject: Re: Eviction problem on ds3100's? i've noticed this before and had suspicions but was never able to produce the problem as reliably as you were. today i have been able to reproduce that problem about half the time, including at least a couple of times when i got ugen: internal: cannot open /tmp/ctmgta28249 what i'm now wondering is whether it's a problem with unix compatibility in the kernel. normally, system calls like read & write go through a level that handles signals and retry. in sprite, other system calls have to do a similar check due to migration -- in particular, Fs_IOControl and perhaps Fs_Select may be more robust than in unix w.r.t. signals. can open get aborted by a signal, for example? anyway, i haven't been able to reproduce the "exit code 1" problem with programs other than cc. if compatibility is the problem, i think we may just have to accept a lack of transparency for ultrix binaries or make such binaries nonmigratable once active, or something. i'm open to suggestions. i may do some more kernel hacking to try to get to the root of this thing, but bob said something about rebuilding ds3100 objects using gcc tonight so this isn't a good time. i'll add it to my to-do list. 929. Date: Wed, 14 Feb 90 08:31:17 PST From: ouster (John Ousterhout) Subject: Piracy is in the debugger "MachKernelExceptionHandler: Address error on load: addr: 8013121e PC: 80085988" Anyone care to take a look? If not I'll reboot it in a couple of hours. 930. Date: Wed, 14 Feb 90 16:17:53 PST From: Fred Douglis <douglis> Subject: local fsync broken on new (ds3100?) kernels mark sullivan thought his kernel was broken because he couldn't write files in emacs. it turns out that writing to a file on the same machine (i.e., something in /user2 on assault, or /postdev on babylon) generates an IO error in emacs. since the file is written successfully, it appears it must be fsync that's broken, just like for nfsmount partitions. mark says that this isn't the case on the default ds3100 kernel running on babylon, but it is the case on assault (1.056) and babylon running mark's new kernel. i looked over the fs code and it appears that the Fs_FileWriteBackStub got changed to call Fs_IOControlStub, except my impression is that this won't work because Fs_IOControlStub will try to copy in the args from user space instead of taking the args passed within the kernel. 931. Date: Wed, 14 Feb 90 16:40:02 PST From: Fred Douglis <douglis> Subject: pmake hanging problem it turns out the current version of pmake only selects on up to 32 descriptors at a time. it seems that it's possible to get past that point and then miss outstanding output. this is the straw that's breaking the camel's back... this problem is fixed in adam's newer version, which i'm going to port to sprite. 932. Date: Wed, 14 Feb 90 17:28:39 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: tx bug report It is possible for the TERMCAP of a tx window to be wrong. This will happen if your window manager does not display titles, and if your tx does not display titles (ie the tx window doesn't have a title), and if you specify the geometry of the tx window (eg tx =80x23+0+0). I think the TERMCAP has too few lines. Rpn will not work correctly. 933. Date: Wed, 14 Feb 90 20:45:06 PST From: Fred Douglis <douglis> Subject: sun4 loader bug i compiled the migration daemon on a sun4 for a sun3. the resulting binary ran okay on some sun3s but not others. on the others it printed "couldn't extend heap" and exited. (I noticed that jhh was having trouble with the sun3 X server saying the same thing recently after it had been recompiled; i'll bet that tve compiled that on a sun4 too.) anyway, gdb showed that brk.c's "nextAddr" variable was set to 0 rather than the address of 'end'. relinking on a sun3 was fine. 934. Date: Wed, 14 Feb 90 22:42:25 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: rcs -u vs co -u "rcs -u" leaves a file root writable so the next "co -l" prompts you if you really want to overwrite the file. This confuses both addhost and myself. "co -u" will leave the file unwritable. I'm not sure if this can be considered a bug of "rcs -u", but please use "co -u" instead. 935. Date: Thu, 15 Feb 90 09:19:00 PST From: brent (Brent Welch) Subject: local fsync() fixed Oops. I left out a break statment in a switch, so the fsync() worked but then fell through to a default error case. I've fixed this and installed the fsio module (for all machine types). 936. Date: Thu, 15 Feb 90 10:00:41 PST From: brent (Brent Welch) Subject: VM clean up after failed recovery A number of machines acted up after Allspice rebooted last night. I traced the problem to skipped clean up actions in Vm_PageIn. I think there is a trivial fix, and I've asked Mendel, Mary, and Mike about it. Symptoms include complaints about page ins failing with status <1>, etc. 937. Date: Thu, 15 Feb 90 10:18:36 PST From: root (The Sprite God) Subject: kmsg doesn't work on sun4's I can only use kmsg to a sun3 from a sun3. I'm not sure about other machine cases. >From a sun4, anyway, kmsg has no effect. 938. Date: Thu, 15 Feb 90 12:16:32 PST From: pmchen@ginger.Berkeley.EDU (Peter M. Chen) Subject: problems sending mail When sending mail from mustard (a decstation), I get: Reserved instruction in process 52c34 at pc=417570 in my syslog. 32c36 DEBUG 0:00 send-mail -i -m bugs 72c0b DEBUG 0:00 send-mail -i -m bugs jhh And the mail doesn't come out. ps. The two bug reports I was sending out were: 1) Subject: clear on sparcstations under UNIX When I use "clear" on a sparcstation (coriander), it coredumps under tx. I had just set the termcap manually (via the Control menu). 2) Subject: booting mustard (decstation) (mustard used to be called apathy) When booting "new" (1.057), I often get a TLB LD miss which puts mustard in the debugger. This has happened three times, but doesn't always happen. 939. Date: Thu, 15 Feb 90 12:25:58 PST From: pmchen (Peter M. Chen) Subject: unable to send mail on mustard I can send mail fine on garlic (ds3100) but not from mustard (also a ds3100). garlic is running 1.058; mustard is running 1.057 (but I don't think that's it). 940. Date: Thu, 15 Feb 90 12:11:25 PST From: pmchen (Peter M. Chen) Subject: booting mustard (decstation) (mustard used to be called apathy) When booting "new" (1.057), I often get a TLB LD miss which puts mustard in the debugger. This has happened three times, but doesn't always happen. 941. Date: Thu, 15 Feb 90 15:07:45 PST From: Fred Douglis <douglis> Subject: pdev master not always marked as gone i'm having trouble with my migration daemon because it doesn't always restart if the previous incarnation dies. the file system says the pdev is busy. however, there's no migd process running and nothing else that would have reason to be using that pdev, so it seems like a reference count is getting messed up. removing the pdev is fine, but might result in pdevs showing up in lost+found or in any case lingering about when they shouldn't (plus, the exclusive access for the master is used to ensure that only one master runs per host, so removing it must be done by hand rather than automatically by migd itself). 942. Date: Thu, 15 Feb 90 16:37:20 PST From: mendel (Mendel Rosenblum) Subject: migration and floating point don't mix Programs using the floating point unit on sparcStations and decStations don't migrate correctly. The following test program demonstrates the problem if you migrate it while its running. double gf; main() { register double lf; gf = lf = 1.0; while (1) { gf += 1.0; lf += 1.0; if ((gf != lf)) { printf("Error gf = %f, lf = %f\n", gf, lf); exit(1); } } } It seems be close to 100% repeatable. There nothing special about this program. Just about any program that interacts with the floating point alot will fail. This a serious problem because gcc uses floating point when compiling programs that contain floating point. 943. Date: Thu, 15 Feb 90 17:16:33 PST From: shirriff (Ken Shirriff) Subject: .mk files need to be cleaned up As a possible spring cleaning activity, the /sprite/lib/pmake/*.mk files need to be cleaned up, because there are a bunch of fixes that have been made to some, but not all of the files. For instance, boot.mk didn't have the changes for ds3100 .s files. 944. Date: Thu, 15 Feb 90 17:23:23 PST From: tve (Thorsten von Eicken) Subject: lots of things missing in /sprite/lib/include/math.h !? For example: (taken from SunOS) #define M_LN2 0.69314718055994530942 #define M_PI 3.14159265358979323846 #define M_SQRT2 1.41421356237309504880 #define M_E 2.7182818284590452354 #define M_LOG2E 1.4426950408889634074 #define M_LOG10E 0.43429448190325182765 #define M_LN10 2.30258509299404568402 #define M_PI_2 1.57079632679489661923 #define M_PI_4 0.78539816339744830962 #define M_1_PI 0.31830988618379067154 #define M_2_PI 0.63661977236758134308 #define M_2_SQRTPI 1.12837916709551257390 #define M_SQRT1_2 0.70710678118654752440 none of which exists in sprite! And how 'bout the whole business of providing exception handling? 945. Date: Thu, 15 Feb 90 17:11:08 PST From: sequent!fubar@uunet.uu.net Subject: Anti-social behavior in shutdown While testing my implementation of Mach_MonAbort & friends, shutdown displayed the following unfriendly behavior: yads: shutdown -hlep Unknown option "-hlep"; type "shutdown -help" for information 00: Waiting with 1 user processes still alive 00: Waiting with 10 kernel processes still alive 00: Main exiting 00: Rpc_Daemon exiting. 00: Proc_ServerProc exiting. 00: Proc_ServerProc exiting. 00: Proc_ServerProc exiting. 00: Proc_ServerProc exiting. 00: Proc_ServerProc exiting. 00: Recov_Proc exiting. 00: Syncing disks 00: Returning to firmware... Flush caches Date 90/02/16 00:52:51 UTC * Shutdown shouldn't actually do anything if there are bad options. 946. Date: Thu, 15 Feb 90 22:28:42 PST From: shirriff (Ken Shirriff) Subject: Re: Anti-social behavior in shutdown The problem with shutdown is actually a problem with Opt_Parse. As far as I can tell, if Opt_Parse doesn't like an argument it prints a message to stderr and skips it, and there's no way for the program to tell there's a problem. I think Opt_Parse should return an error in this case. 947. Date: Fri, 16 Feb 90 08:23:24 PST From: ouster (John Ousterhout) Subject: Re: .mk files need to be cleaned up In response to Ken's note: As a possible spring cleaning activity, the /sprite/lib/pmake/*.mk files need to be cleaned up, because there are a bunch of fixes that have been made to some, but not all of the files. For instance, boot.mk didn't have the changes for ds3100 .s files. I don't think this should wait for spring cleaning. Can whoever added the change for ".s" files check to make sure those changes are in all of the .mk files including boot.mk? If there are other things that aren't uniformly applied to the .mk files, let's get them fixed too. Just to refresh everyone's memory on the purpose of spring cleaning, it's not to fix random small bugs that accumulate: those should be fixed when found. Spring cleaning is for larger things that don't make sense as part of someone's research or as part of everyday bug fixing. If we allowed bugs to be deferred until spring cleaning then we'd quickly reach a state where no bugs were fixed except during spring cleaning. 948. Date: Fri, 16 Feb 90 08:41:39 PST From: Fred Douglis <douglis> Subject: Re: Log problem [John reported a bug sending mail to sprite log, with a complaint that logger couldn't create /tmp/sh-something-or-other.] The problem is that the file in /tmp already existed. This is partly an artifact of not cleaning up /tmp on reboots. I renew my suggestion of many months/years ago that we change tmp to be /hosts/%HOST/tmp, clean that directory on boottime, and have a separate shared /sprite/tmp along the lines of /usr/tmp. This would require that HOST be treated like MACHINE in pathname lookups, which would simplify a lot of other stuff too. 949. Date: Fri, 16 Feb 90 10:17:01 PST From: sullivan (Mark Sullivan) Subject: rcp This is new. babylon<3> !rcp rcp shangri-la:.login . C0644 673 .login No it did not copy the file. 950. Date: Fri, 16 Feb 90 10:48:28 PST From: eklee (Edward K. Lee) Subject: possible mail bug Here's the header from one of the messages I received today/yesterday. The 'From' line is incorrect; it should read rlee@island.seas.ucla.edu. I've moved the mail file to /sprite/spool/mail/eklee.bak. Also, any ideas why it took the mail two days to reach me? Ed ----cut-- >From rlee@island.Berkeley.EDU Thu Feb 15 20:48:58 1990 Date: Wed, 14 Feb 90 10:46:40 PST From: Robert Lee <rlee@island.Berkeley.EDU> To: eklee@ernie.Berkeley.EDU Subject: Hi Ed 951. Date: Fri, 16 Feb 90 13:35:56 PST From: Fred Douglis <douglis> Subject: cc1.68k went into debugger unfortunately there was no unstripped version to debug. i think it was subsequent to eviction. is cc1.68k compiled with hardware floating point? If so, the same bug mendel found on sun4s and ds3100s could exist with sun3s (though compiling his program on sun3s with -m68881 didn't demonstrate the bug.) in any case, is there any reason not to leave around unstripped versions of commonly used programs such as cc*? 952. Date: Fri, 16 Feb 90 14:08:52 PST From: Fred Douglis <douglis> Subject: pdev master screwups for the past several minutes, mint thought there was a master associated with /hosts/kvetching/migd.pdev, while kvetching did not. stats of that file returned an error. opening the file as a master got an error from mint saying the file was busy. unlinking the file on kvetching got a "nonexistent file" complaint, but i was able to unlink it successfully from another host, at which point i could then start the daemon successfully. 953. Date: Fri, 16 Feb 90 21:40:56 PST From: rab (Robert A. Bruce) Subject: new sparcStations Garth reported that he was having problems rsh'ing from apathy to other sparcStations. So I tried it from sabotage, and I am unable to rlogin in to any of the new sparcStations. That particular error message seems to depend on what machine I attempt to login from. When I attempt to login from another sparcStation, or a sun3: tyranny.Berkeley.EDU: address already in use When I attepmt to login from a ds3100: tyranny.Berkeley.EDU: unknown error (0) When I attempt to login from unix: Protocol error, tyranny.Berkeley.EDU closed connection I get the same result from all the new sparcStations. 954. Date: Sat, 17 Feb 90 17:06:54 PST From: rab (Robert A. Bruce) Subject: oregano /c started hanging. I ran rpcstat on oregano and saved the output in ~/oregano.rpc. I tried to kill and restart the daemons, but that didn't help. When I ran shutdown, oregano went into the debugger: Fatal Error: HandleRelease, handle <1,38,3,2> "/c" not locked. Entering Debugger with a FPU Inexact Result exception at PC 0xe0639cc The instruction at 0xe0639cc is not a fp instruction, and the fpu is not set up to trap on inexact results, so I don't think that was the real reason for the trap. 955. Date: Sun, 18 Feb 90 14:08:39 PST From: gibson (Garth Gibson) Subject: sparcstation gremlin gremlin seems to be broken on the sparcstation (apathy) as it flashes the image is shifted about an inch to the left and Xor'd over the prior image - interesting results, but limited usefulness 956. Date: Mon, 19 Feb 90 14:54:43 PST From: Fred Douglis <douglis> Subject: Re: find actually, you were partly correct. adding -print was necessary, but so was specifying a file rather than a link. ~choi is actually /users/choi, which is a link to something else. find doesn't traverse symbolic links (thankfully), which is why "find ~ -name foo -print" wouldn't work either (not so thankfully). i'm not quite sure what the best way is to address this problem. having a fixed /users directory is really useful, but it has hidden side effects such as this one. spriters: is this worth talking about on wednesday? ron: do "find ~/. ..." and it should work fine. 957. Date: Mon, 19 Feb 90 15:37:13 PST From: Fred Douglis <douglis> Subject: another sun3 program with ld problem i installed a new version of update a few days ago. turns out it wouldn't run on a sun3, or at least not fenugreek just now. i relinked it on fenugreek and it worked okay. i've installed the new version. 958. Date: Tue, 20 Feb 90 10:39:44 PST From: mendel (Mendel Rosenblum) Subject: sun4 floating point problem. Signal handlers that do floating point operations don't work on the sun4 because of floating point state is not saved when a user's signal handler is called. If the signal handler uses the FPU it will trashe the FP registers active at the time of the signal. The ds3100 and the sun3 seems not to have this problem. 959. Date: Tue, 20 Feb 90 12:07:22 PST From: mendel (Mendel Rosenblum) Subject: sun4 migration bug Migrating programs when nextPc != pc+4 doesn't work on the sun4 because Proc_ResumeMigProc calls Mach_StartUserProc which always sets nextPc to pc+4. If a process is unluckly enought to take a timer interrupt in the delay slot of a branch instructions the branch will not be taken when execution is resume after the migration. 960. Date: Tue, 20 Feb 90 17:58:40 PST From: Fred Douglis <douglis> Subject: bad interaction between swap space & debug on a sun4c paging from /sprite, when /sprite filled up, "debug foo" would hang in a totally unkillable state. apparently, even when the space freed up, those processes were wedged big time. when the machine was eventually shut down, those processes wouldn't die. 961. Date: Tue, 20 Feb 90 21:45:48 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: machTypeMips.c The file /sprite/src/kernel/proc/machTypeMips.c does a #define mips, even though the file is not being compiled for a mips machine. This is potentially very bad. Right now it screws up Brent's addition of the FMT_MY_FORMAT constant to fmt.h. I don't think this is a problem since that constant is not used in the file, but I think it is bad practice anyway. 962. Date: Tue, 20 Feb 90 22:26:31 PST From: pmchen (Peter M. Chen) Subject: reserved instruction I get a reserved instruction error running on garlic (ds3100). Reserved instruction in process f3a0c at pc=402974 This was running kernel 1.058. The program does use floating point, but I thought this bug was gone. ps. John H., any word on the ds3100 crash that crashed after 112 runs (when the screen blanked, etc.)? 963. Date: Tue, 20 Feb 90 23:58:47 PST From: rab (Robert A. Bruce) Subject: Missing links in directory The directory /users/gibson/RAID/sim.RELI/RCS/ has no links for `.' or `..'. 964. Date: Wed, 21 Feb 90 02:04:21 PST From: tve (Thorsten von Eicken) Subject: migd problem Fred, I think your migd just gave up. I ran pmake.new, the I saw an "RPC to terrorism hung" message on the console and now the pmake hangs forever. I tried to rlogin again (working from home...) and got the login but then everything hangs (regardless of which machine I log into). Now I have a "la" in my .login which doesn't arrange things. I finally got rid of that and now things are ok, except for what's hung. All this happened at about 2am. 965. Date: Wed, 21 Feb 90 07:50:10 PST From: mgbaker (Mary Gray Baker) Subject: trashed mail file? I had two messages as one in my mail file: ****************************************************************************** >From rab Tue Feb 20 23:59:01 1990 From: rab (Robert A. Bruce) To: bugs Cc: rab, gibson Subject: Missing links in directory Date: Tue, 20 Feb 90 23:58:47 PST The directory /users/gibson/RAID/sim.RELI/RCS/ has no links for `.' or `..'. -bob Received: by sprite.Berkeley.EDU (5.59/1.29) id AA998971; Wed, 21 Feb 90 00:16:16 PST Date: Wed, 21 Feb 90 00:16:16 PST From: elm (ethan miller) Message-Id: <9002210816.AA998971@sprite.Berkeley.EDU> To: mgbaker Subject: Re: joyride No problem. Actually, joyride is not my machine; it belongs to either Rich Drewes or Pete Leong, who are using those desks. ethan 966. Date: Wed, 21 Feb 90 08:39:32 PST From: Fred Douglis <douglis> Subject: Re: migd problem I did a ps on terrorism and both migds were in the SUSP state! I have no idea why this happened -- I'll add something to ignore the signal -- but continuing them seems to have cleared things up. 967. Date: Wed, 21 Feb 90 09:31:02 PST From: ouster (John Ousterhout) Subject: Missing syslog's When I attempted to run wall today I noticed that neither tyranny nor sedition has a syslog device in its host-specific directory. Does this mean that the script to add a host isn't creating them automatically? 968. Date: Wed, 21 Feb 90 12:38:55 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: /sprite/admin/hosts This file is a little out of date. We should be sure to update it when machines change physical location and "owners". I also noticed that the entry for sage is wrong. If addhost was used to add the new sparcstation under the name sage then there is a bug. If addhost wasn't used then it should have been. 969. Date: Wed, 21 Feb 90 13:14:45 PST From: mendel (Mendel Rosenblum) Subject: rint() library routine broken If rint() math library routine doesn't work on sun3s and sun4s running Sprite. It does work on the ds3100. An example program is: #include <math.h> main() { double in, out; in = 2040109464.65; out = rint(in); printf("%f %f\n", in, out); } which prints out: 2040109464.650000 2040109464.000000 on sun3s and sun4 and should print out: 2040109464.650000 2040109465.000000 970. Date: Thu, 22 Feb 90 09:20:15 PST From: ouster (John Ousterhout) Subject: Mint reboot When I came in this morning, Mint did not respond to rlogin connections. I went upstairs and restarted the daemons, but the "restartservers" script hung without completing. At that point I gave up and reboooted Mint. However, it's possible that all the problems were do to csgw's being detached from the network, so perhaps the rebooting was unnecessary. 971. Date: Thu, 22 Feb 90 10:48:41 PST From: brent (Brent Welch) Subject: Re: Mint reboot Right, the bootp program goes into the debugger if the gethostbyname() call fails. It checks against a return of -1 instead of 0, and gets a bus error. It seems goofy that gethostbyname() fails. Shouldn't we patch it to consult /etc/spritehosts (doesn't the Host_ library use /etc/spritehosts?). This causes mint's boot script to fail. I had mint up and limping last night during the partition, but I'm glad you rebooted it. 972. Date: Thu, 22 Feb 90 11:55:58 PST From: ouster (John Ousterhout) Subject: Tftp enabled in inetd.conf? Some of our replicated inetd.conf files have the "tftp" line enabled and some have it commented out. I think it needs to be commented out everywhere: if tfptd is needed it's started in the bootcmds file explicitly, right? I had troubles at DEC with this and couldn't get diskless machines to boot if inetd was dealing with tftp requests. Things worked when I modified inetd.conf to have the tftpd line commented out. Does anyone know anything about this? -John- 973. Date: Thu, 22 Feb 90 15:57:20 PST From: mgbaker (Mary Gray Baker) Subject: Getting wrong TM type by default All of a sudden today I'm getting TM=sun3 defined while compiling on treason. I thus accidentally overwrote a command in the sun3 area since "pmake installsun4" decided to do a TM=sun3 instead. Has somebody changed something? 974. Date: Thu, 22 Feb 90 23:33:07 PST From: Fred Douglis <douglis> Subject: migration fixes i've installed a new mach for the ds3100 that i believe fixes the floating point migration bug. it doesn't fix the "error code NN" problem we see with cc -- i don't believe we've come up with a satisfactory solution to that, except that maybe moving to gcc and compiling with a sprite-native library will retry calls that need to be retried -- but it should fix the problem of having inconsistent results. i've also fixed the floating point problem for the sun4c.. in each case it was a question of (1) getting the FPU registers and (2) setting the FPU status bit to indicate they should be restored. while i was at it i put in a check for a non-sequential PC, requiring that I put an extra check in sun4.md/machCode.c for the SIG_MIGRATE_TRAP bit being set after no more signals are officially pending. (On the sun3 this makes a function call on every special handling return, and it could just as easily be a macro.. but i guess the new machines are the ones that count. the sun4 didn't enable it at all.) for the latter fix, i tried migrating something that was just bouncing around a lot between goto's, subroutine calls and assignment statements. it doesn't migrate successfully, consistently, but it's not exactly a repeatable test case either. i'll keep at it. now i think the only remaining FPU problem is the sun4 signal handlers, right? i guess i won't install mach for the sun4 until that's also there, unless mary thinks i shouldn't wait... 975. Date: Fri, 23 Feb 90 11:40:11 PST From: Fred Douglis <douglis> Subject: rpc timeouts foil migd when i shut down the server running migd, other hosts get RPC timeouts and their migd daemons freeze up until recovery. from a glance at the source code for Fs_Write, making the writes non-blocking won't help, since it doesn't look at that. any suggestions? certainly there are programs that don't want to know about timeouts, and where declaring a host dead immediately and returning an error wouldn't be doing anyone any favors. in this case, though, i need more support for that. maybe an iocontrol on the stream to say not to wait for recovery? i'm sure brent will have some ideas on this when he gets back... and/or we can discuss this as an agenda item on monday. 976. Date: Fri, 23 Feb 90 15:58:46 PST From: douglis@dill (Fred Douglis) Subject: grim network deadlocks; fscheck debug problem /etc/spritehosts got locked up this afternoon. mint wanted to tell treason to return the attributes for this file, but treason thought the file was locked and the callback was blocked waiting fo r it to be unlocked. when treason recovered with mint a little before that, it skipped spritehosts because it thought it was locked. mint then told kvetching to flush back the attributes, but this timed out because kvetching died with a garbaged stack and/or garbaged heap (it was in a panic from free() and the backtrace wasn't entirely sensical.) by the way, for the record, the earlier catastrophe started when allspice ran out of pmegs, since it didn't reduce its fs cache size when it was rebooted during the network meltdown and didn't make it through bootcmds. then upon reboot, two fscheck processes went into the debugger. it is very very bad that (1) these can go into the debugger when run in foreground since there's no way to get a prompt or an rlogin to debug them. maybe fscheck should catch the signal, report an error and cause bootcmds to bail to a single-user shell? jhh rebooted allspice single user to run fscheck by hand in the background, and this time it finished fine. exiting the su shell caused use to run fscheck yet again, and allspice eventually came up. i was gone then but understand it, and mint, died pretty horribly after that -- someone else can send mail if they know the details. 977. Date: Fri, 23 Feb 90 16:10:21 PST From: Fred Douglis <douglis> Subject: fs/recov bug larceny crashed right after recovery with mint earlier today. its syslog indicated that /hosts/larceny/migInfo.new was skipped during recovery, whatever exactly that means. Then the panic was GetDirtyBlock, bad block because blockPtr->fileNum was 2576 and cacheInfoPtr->hdrPtr->fileID.minor was -2576. 978. Date: Fri, 23 Feb 90 17:03:04 PST From: Fred Douglis <douglis> Subject: problem with new sun3 compiler? I found that all of a sudden, migd on sun3s was listing the load as the most recent queue length rather than aging it using a weighted average. i saw that the weights were an array of floats all set to 0, apparently because they weren't double-aligned. i tried loading with /sprite/cmds.sun3.old/ld and /sprite/src/cmds/ld.old/sun3.md/ld, and nothing worked. this program runs just fine on ds3100s and sun4s, just not on sun3s. i can tell it's not the program screwing up because gdb on the binary, without running, prints the array as three zeros despite the code #define WEIGHT1 0.9200444146293232 /* exp(-1/12) */ #define WEIGHT2 0.9834714538216174 /* exp(-1/60) */ #define WEIGHT3 0.9944598480048967 /* exp(-1/180) */ double migd_Weights[] = {WEIGHT1, WEIGHT2, WEIGHT3}; loading with the sun4 didn't help. recompiling from the sun4 did. (and the linked image even executes on fenugreek, and nextAddr isn't 0. son of a gun. but when i find another executable that has that, bob, i'll let you know...). 979. Date: Fri, 23 Feb 90 17:56:57 PST From: douglis (Fred Douglis) Subject: recovery is the pits (and so is sprite right now!) Keywords: recovery bugs 1) each machine probably just recovered with mint at least 10 times. i asked mary what it would take to fix the recovery storm problem, and she said it would be easy to fix but hard to then measure the effects of various changes. how about a mean nasty kernel that gets recovery storms that can be booted for testing purposes and a nice clean kernel that doesn't get recovery storms and lets people get work done? 2) larceny bit the big one yet again due to recovery. this time, an unaligned address trap in the kernel, right after recovery. seems like recovery is trashing memory. working on sprite today has been a miserable experience. lately it seems to be worse than it was when john first decided things were bad enough to offer a dinner for more stability. what next? 980. Date: Sat, 24 Feb 90 06:09:59 PST From: rab (Robert A. Bruce) Subject: assault ran out of memory Assault crashed a few minutes ago. It said `Vm_RawAlloc out of memory'. Here is the stack trace: 0 .block549 ["sysPrintf.c":209, 0x800b8ed0] 1 panic(va_alist = -2146494756) ["sysPrintf.c":209, 0x800b8ed0] 2 Vm_RawAlloc(numBytes = -1073288280) ["vmSubr.c":254, 0x800ca588] 3 MemChunkAlloc(size = 384, addressPtr = 0xc0803ee8) ["memSubr.c":94, 0x8008e bc8] 4 .block366 ["memory.c":508, 0x8008ef44] 5 malloc(numBytes = -1073292840) ["memory.c":508, 0x8008ef44] 6 .block316 ["fsSpriteDomain.c":420, 0x8007f538] 7 Fsrmt_RpcOpen(srvToken = 0xc006eba8, clientID = 17, command = 7, storagePtr = (nil)) ["fsSpriteDomain.c":420, 0x8007f538] 8 .block489 ["rpcServer.c":199, 0x800ae55c] 9 Rpc_Server() ["rpcServer.c":199, 0x800ae55c] 10 .block505 ["schedule.c":944, 0x800b2714] 11 Sched_StartKernProc(func = 0x800ae1a0) ["schedule.c":944, 0x800b2714] 12 Sched_StartKernProc(func = [bad address (0xc0804008)]) ["schedule.c":914, 0 x800b268c] 981. Date: Sat, 24 Feb 90 09:50:10 PST From: ouster (John Ousterhout) Subject: Tyranny crash When I came in this morning Tyranny was looping infinitely printing messages on the console and apparently trying to enter the debugger. The messages reported a deadlock on "Proc:serverMutex @ 0xf612cee8". The holder PC was 0xf60752f4 and the current PC was 0xf6075294. Both of these addresses correspond to the MASTER_LOCK at the beginning of CallFunc in procServer.c. 982. Date: Sun, 25 Feb 90 15:22:20 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: loadavg bug I find it hard to believe that mint has been up for over two months. mint sun3 up 60+23:14 refuses 0.94 0.72 0.54 (0+03:08) 983. Date: Sun, 25 Feb 90 15:47:12 PST From: Fred Douglis <douglis> Subject: Re: loadavg bug there is no way for user programs to find out the boot time so loadavg uses the date of /hosts/%HOST/boottime. when mint booted it couldn't find out the current time and did not use a reasonable default (0 == 1/1/70.) can't it look at its own TOD clock?? 984. Date: Mon, 26 Feb 90 10:11:02 PST From: ouster (John Ousterhout) Subject: ar broken for sun4 The sun4 version of ar seems to be broken. If I type "pmake debug" in the Tcl library (running on a SPARCstation), it compiles all the .go files, but the "ar" line seems to do nothing: it doesn't create a sun4.md/libtcl_g.a file. I also noticed that typing "ar tv" on libtcl_g.a (back when the file existed) caused ar to dump core (this was why I deleted it and started recompiling everything). I see that ar was created last night by rab, and I know that it worked OK on Saturday. I also see that there is no backup version of ar in /sprite/cmds.sun4.old. Shouldn't there *always* be a backup version of any program that has been installed recently? 985. Date: Mon, 26 Feb 90 11:22:37 PST From: Fred Douglis <douglis> Subject: assault died again it was identical to the bug bob reported, from what i could tell. i examined it in kdbx and it seems that in Fsrmt_RpcRead, paramsPtr is set properly initially (enough to get the right stream & hdr pointers) but then gets trashed, so it's no longer the same value as storagePtr->requestParamPtr. then Fsrmt_RpcRead calls malloc with the garbage pointed to by the bad paramsPtr. i don't see any obvious way for the stack to get clobbered this badly, but i wonder if maybe the pointer is being saved in a register that got clobbered and not restored during an interrupt. i noticed that each time that assault died with a garbage pointer it's been shortly after recovery with some client, if that's of interest. 986. Date: Mon, 26 Feb 90 11:47:27 PST From: pmchen (Peter M. Chen) Subject: xproof running with ditroff -Pxproof throws Xmfb into the debugger. For example, try cd ~pmchen/striping/simul/sigarch tbl -Plw508-5 camera | grn -Plw508-5 | eqn -Plw508-5 | ditroff -me -Pxproof 987. Date: Mon, 26 Feb 90 17:56:29 PST From: Fred Douglis <douglis> Subject: addHosts needs to update ginger ginger:/etc/hosts.equiv should have an entry for all sprite hosts since they may wish to access non-sprite printers. this probably can't be done automatically, but a message to this effect would be good. (there's already a statement to this effect in howto/addNewHost.) 988. Date: Mon, 26 Feb 90 23:40:56 PST From: shirriff (Ken Shirriff) Subject: dvips I got dvips working in /sprite/cmds/dvips, using Fred's makefile setup. For some reason the printer dies if the postscript contains the comment: %%EndProlog and works if I remove this. dvips worked on two test files and then died when I tried to print a third test file (leiden.dvi), which has a letterhead at the top, it died. The old dvi2ps handles this. So my conclusion is that the new dvips doesn't work as well as the old dvi2ps. If anyone wants to figure out why dvips dies, they're welcome to try. 989. Date: Tue, 27 Feb 90 12:28:18 PST From: Fred Douglis <douglis> Subject: dvips still doesn't work i tried it on an extensive paper (the migration TOCS submission) and the printer chewed on it for many minutes and then gave up. I found I had to remove some stuff in my TeX document that took advantage of something in the dvi2ps postscript profile that i couldn't get to work for dvips. now it prints normal text okay, but it complained: psif[c4946]: status: (Error: VMerror; OffendingCommand: def) psif[c4946]: Unrecognized status message: Error: VMerror; OffendingCommand: def at about the time it tried to include a postscript figure. does this mean it ran out of memory? btw, the ErrorLog had many entries of the form psif[1494e]: status: (status: waiting; source: serial 25) -- it turns out that stuff like this has caused lw477's spool directory to use more than a megabyte of space. 990. Date: Tue, 27 Feb 90 18:12:41 PST From: Fred Douglis <douglis> Subject: Re: dead migration daemon on terrorism at 6:09PM Tuesday. Is this normal (or at least directly attributable to what you're doing)? yup, it's (sort of) my fault. the daemon was running on a host with a slow clock, confusing things. i did an rdate and it caused the daemon to think all the hosts were down since their timestamps were old. there was a bug that caused the daemon to return an error to the other daemons, and i hope i've fixed that now. i'll restart the daemons any moment. i really wish sprite had adjtime().... 991. Date: Tue, 27 Feb 90 21:47:50 PST From: shirriff (Ken Shirriff) Subject: Strange sendmail bug I had 4 sendmail processes on nutmeg (sun3) go into the debugger. The problem was the data segment from 0x40000 to 0x42000 had been zeroed for some reason. 992. Date: Wed, 28 Feb 90 00:59:07 PST From: tve (Thorsten von Eicken) Subject: /sprite/lib/sendmail/aliases & RCS What's the deal here? The file is world writable (how nice...) and I can't check it in/out (ci error: Directory RCS/ not writable). Is there some policy here? Thorsten 993. Date: Wed, 28 Feb 90 11:47:45 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: tx = If you type 'tx =' tx will go into an infinite loop. 994. Date: Wed, 28 Feb 90 13:55:50 PST From: brent (Brent Welch) Subject: migrated rcp => permission denied If I migrate an rcp from a host that IS in a remote .rhosts file to a host that IS NOT in the remote .rhosts file, I get 'permission denied'. My makefile for ds3100 kernels, for example, does an rcp of the kernel image to dill. If this is migrated to a host that isn't in my .rhosts file on dill, then the rcp (or "rsh date") aborts with "permission denied". I thought that the ipServer on the home node was used... 995. Date: Wed, 28 Feb 90 14:24:18 PST From: mendel (Mendel Rosenblum) Subject: FPU interrupt in Kernel mode When I run my simulator on piquante it sometimes dies with an Illegal Instruction trap with a syslog message of "FPU interrupt in Kernel mode". Anyone interested in this? 996. Date: Wed, 28 Feb 90 14:33:07 PST From: ouster (John Ousterhout) Subject: Piracy in debugger Piracy is in the debugger with a "Reserved instruction" trap in the kernel. Anyone care to take a look? John H.? I'll leave it around for a few days. Let me know if you get done with it and I'll reboot it. 997. Date: Wed, 28 Feb 90 16:21:54 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: tx bug If the user doesn't have write permission to /hosts/<host>, then tx goes into the debugger. It tries to print out an error message that it can't open the pseudo-device, but the name parameter is bogus at that point. 998. Date: Wed, 28 Feb 90 17:34:48 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: update bug The update program does not set permissions correctly. In particular, if you have a non-setuid file, then the permissions of the destination file are set to the permissions of the source file modified by the current user's umask. This means when I updated all the stuff over in Cory everything is not writable by world. I will fix update unless this isn't a bug. 999. Date: Wed, 28 Feb 90 17:52:44 PST From: mendel (Mendel Rosenblum) Subject: cc Exit 1 on ds3100 If you control-Z and background a running "cc" command on the ds3100 it will exit with a status of 1. For example: piquante% cc -c -O0 t2.c ^Z Suspended piquante% bg [2] cc -c -O0 t2.c & piquante% [2] Exit 1 cc -c -O0 t2.c 1000. Date: Wed, 28 Feb 90 18:17:18 PST From: Fred Douglis <douglis> Subject: Re: migrated rcp => permission denied this was a case of bitrot. even as of january, rcp was apparently being linked with a version of gethostname that would return the physical host rather than the virtual host. (maybe this is something that used to work, then things got changed to use a system call and it was broken for a while, and then it worked again?) in any case, i'm installing new copies of rcp -- please let me know if this doesn't fix your problem. 1001. Date: Thu, 01 Mar 90 11:07:20 PST From: Fred Douglis <douglis> Subject: host timeouts the ConsistTimeout stuff seems to go through a full rpc timeout for every file rather than deciding that a host is down and then marking everything as available immediately. this causes things to hang (rpc to mint is hung.... a minute later RPC ok, and immediately another hung rpc). perhaps this would be fixed if we finally move to marking hosts as crashed as soon as an RPC times out. should we? 1002. Date: Thu, 1 Mar 90 14:01:56 PST From: pmchen (Peter M. Chen) Subject: la hangs On mustard (ds3100), "la" hangs in the WAIT state. No message to the syslog, though some previous messages on the syslog were: 3/1/90 12:24:21 larceny (73) RmtPdev "/sprite/admin/migInfo.pdev" <2097153,-1976963045> Reopen failed : cacheable/busy conflict 3/1/90 12:24:21 larceny (73) RmtPdev "/sprite/admin/migInfo.pdev" <2097153,-1976962679> Reopen failed : cacheable/busy conflict 3/1/90 12:24:21 larceny (73) Recovery failed: cacheable/busy conflict <12>Mar 1 12:24:21 syslog: Mig_GetAllInfo: error during ioctl to global master: the file handle is out of date <14>Mar 1 12:24:44 syslog: DbLockDesc: lock timed out (file /sprite/admin/userLog). 1003. Date: Thu, 1 Mar 90 16:45:48 PST From: pmchen (Peter M. Chen) Subject: migd dies In ~pmchen/striping/simulsabre, I pmake -J 30 and larceny died, which killed my migd daemon. Here's my syslog: RPC srvr 52c3e RPC srvr d2c15 RPC srvr 72c31 RPC srvr 72c37 RPC srvr 82c35 <write> 3/1/90 16:42:24 larceny (73) RPC timed-out <30>Mar 1 16:42:29 migd[12c19]: Write to global daemon timed out. <close> 3/1/90 16:42:35 larceny (73) RPC timed-out <dev open> 3/1/90 16:42:40 larceny (73) RPC timed-out <dev open> 3/1/90 16:42:48 larceny (73) RPC timed-out 3/1/90 16:42:55 larceny (73) - recovering handles 3/1/90 16:42:55 larceny (73) RmtPdev "/sprite/admin/migInfo.pdev" <2097153,-1976962201> Reopen failed : cacheable/busy conflict 3/1/90 16:42:55 larceny (73) RmtPdev "/sprite/admin/migInfo.pdev" <2097153,-1976962056> Reopen failed : cacheable/busy conflict 3/1/90 16:42:55 larceny (73) Recovery failed: cacheable/busy conflict <12>Mar 1 16:44:26 syslog: No migd daemon running on your host. 1004. Date: Thu, 01 Mar 90 12:57:39 PST From: sequent!fubar@uunet.UU.NET Subject: Problem with arp? /sprite/src/daemons/arp/arp.c contains (around line 163): if ((Net_NetToHostShort(packet.protocolType) != NET_ETHER_IP) || (packet.opcode != NET_ARP_REQUEST)) { continue; } I suspect this should be: if ((Net_NetToHostShort(packet.protocolType) != NET_ETHER_IP) || (Net_NetToHostShort(packet.opcode) != NET_ARP_REQUEST)) { continue; } Note the ntohs translation for packet.opcode. The original (I guess) would be ok on Suns, since their byte order is the same as network byte order (whichever that one is, I can never remember). On the i386, the "packet.opcode" field comes in set to 0x100, this is NET_ARP_REQUEST in network byte order. 1005. Date: Thu, 1 Mar 90 23:56:28 PST From: shirriff (Ken Shirriff) Subject: sage crashed I tried to do a pmake on sage (sun4c.BW.243) and it crashed with MachHandleWeirdoInstruction: the error occurred in a user process. MachHandleWeirdoInstruction: unaligned address trap in the kernel! The stack trace is: #0 panic (__builtin_va_alist=-167711087) (sysPrintf.c line 209) #1 0xf600f308 in MachHandleWeirdoInstruction (trapType=112, pcValue=(char *) 0xf6093c2c "\320\004", trapPsr=4194501) (sun4c.md/machCode.c line 1529) #2 0xf6010810 in MachReturnFromTrap () #3 0xf6093c2c in PreparePage (virtAddrPtr=(Vm_VirtAddr *) 0xf82dbe40, protFault=0, curPTEPtr=(unsigned int *) 0xfb) (vmPage.c line 1629) #4 0xf6093874 in Vm_PageIn (virtAddr=(char *) 0x1dfff7c8 Apparently the problem is that somewhere in Vm_PageIn, the value of page changed from 122879 to -131219904, which caused curPtePtr to be 0xfb, a bogus pointer which caused it to die. The machine is in the debugger if any sun4c experts wish to take a look. 1006. Date: Fri, 02 Mar 90 09:43:11 PST From: rab (Robert A. Bruce) Subject: trashed file /sprite/lib/include/assert.h is trashed. It has part of a mail message from sequent!fubar@uunet.UU.NET appended to it. The modification time on the file is Feb 20 21:41, but the mail message was sent Thu Mar 1 20:48:24 1990. Something very interesting is that assert.h was trashed once before (last October) in *exactly* the same place. I moved the trashed file to /sprite/trashed/assert.h.trashed. 1007. Date: Fri, 02 Mar 90 11:20:29 PST From: Fred Douglis <douglis> Subject: piquante is still sick it crashes regularly. the latest one was "bus error on ifetch". doesn't sound good. i think the hardware is bad and the whole machine should be taken out and shot.... :) 1008. Date: Fri, 2 Mar 90 11:24:32 PST From: shirriff (Ken Shirriff) Subject: Re: sage crashed Yes, the code is like: page = transVirtAddr.page; ( where transVirtAddr.page is valid ) ... ptePtr = VmGetAddrPTEPtr(&transVirtAddr, page) Later it crashes because ptePtr is garbage. Page is garbage at this point, so presumably it was garbage at the above line, which caused ptePtr to be invalid. But on second thought, unless it was compiled with the CLEAN flag, VmGetAddrPTEPtr should have checked that page was valid, so maybe my theory is wrong. 1009. Date: Fri, 2 Mar 90 13:59:24 PST From: brent (Brent Welch) Subject: Assault crashed and I figured it out Assault didn't actually crash, but it was left with a process that had a huge stack segment and was unkillable because of a deadlock. I figured it out - had to do with error handling that I added to handle disk full conditions. I had changed things so a pagein was aborted after a segment had gone bad, but I didn't clean up everything associated with the segment. (VmParseVirtAddr has side-effects I was unaware of.) This fix will appear in 1.060 1010. Date: Fri, 02 Mar 90 11:41:49 PST From: sequent!fubar@uunet.UU.NET Subject: More arp trouble, plus bonus typedef problem In /sprite/src/daemons/arp/arp.c (around line 115): if (!Lookup(Net_NetToHostInt(packet.targetProtAddr), ðerAddrPtr)) { continue; } The "Net_NetToHostInt" here is wrong; the inet_addr() function in the inet library (where the internet addresses come from that are put into arp's hash table) returns inet addresses in network order; having this extra conversion here before the lookup causes arp not to function on machines with different byte order than network order. Also, the typedef for Net_InetAddress (to an unsigned int) causes the Net_ArpPacket structure to be misaligned, since longwords are 4-byte aligned. Net_InetAddress should "really" be a u_char[4], but this causes trouble for functions that wish to return type Net_InetAddress. Making Net_InetAddress a struct with four u_char entries might fix the alignment problem, but might not (I've already had to define NET_ETHER_BAD_ALIGNMENT, since the compiler does 4-byte alignment after the "real" Net_EtherHdr structure). To get arp to work for the time being, I've made the *ProtAddr fields of Net_ArpPacket into char[4]'s, and in arp all references to them look like "*((int *)packet.senderProtAddr)." Pretty nasty, but it does work. 1011. Date: Sat, 3 Mar 90 03:30:53 PST From: tve (Thorsten von Eicken) Subject: libc/gnulib problem: no __builtin_new for sun4 (only for sun3) Turns out g++ uses that. 1012. Date: Sat, 3 Mar 90 10:36:57 PST From: mendel (Mendel Rosenblum) Subject: new pmake bug Pmake doesn't work correctly if invoked as "make". All commands exit with an error code of 1. For example: jaywalk% make cc -c f.c *** Error code 1 Stop. jaywalk% If I uses "pmake" or "make -X" it works fine. It appears to be independant of the host selected and happens on both the sun4c and ds3100. When I typed "make -d" the last couple of lines were: f.o:< = f.c Examining f.c...modified 14:54:02 Feb 28, 1990...up-to-date. Examining f.o...non-existent...modified before source...out-of-date. f.o:> = f.c f.o:? = f.c cc -c f.c Rmt_Begin: selected host 63 for migration. Error in Proc_RemoteExec: the file does not exist *** Error code 1 Stop. 1013. Date: Sat, 3 Mar 90 12:09:29 PST From: tve (Thorsten von Eicken) Subject: pmake on sun3s?? -> segmentation violation If I type make or pmake or make -X in /mic/g++/src/libg++.dist/tests on a sun3 I get a Segmentation violation. 1014. Date: Sat, 3 Mar 90 13:39:26 PST From: pmchen@envy-150.Berkeley.EDU (Peter M. Chen) Subject: network problems? I was on mustard (ds3100) remotely from envy when things got really slow. So, I logged out and tried to get to allspice (rlogin) to see if everything was ok. That hung, so I tried assault, which hung. Then I pinged allspice and assault, which hung. Then I pinged oregano, which pinged back ok. I tried to rlogin to oregano, which hung. Then I repinged oregano, which *hung*. Is it possible that my rlogin's killed the ipServer (or some such), so that machines couldn't ping back? I pinged mustard successfully, tried to rlogin (unsuccessful), then re-pinged UNsuccessfully. 1015. Date: Sat, 3 Mar 90 15:53:33 PST From: mendel (Mendel Rosenblum) Subject: error message problem If I run a sun3 binary on a sun4c I get the following message in my syslog: Proc_Exec: can't run sun3 (0413) a.out file on 1016. Date: Sun, 4 Mar 90 01:33:29 PST From: tve (Thorsten von Eicken) Subject: these unsynched clocks are a pain! I fighting this all the time: 13 -rw-rw-r-- 1 tve mic 12510 Mar 4 01:30 stream.C 64 -rw-rw-r-- 1 tve mic 58305 Mar 4 1990 stream.o 9 -rw-r--r-- 1 tve mic 8900 Mar 4 01:30 streambuf.C 35 -rw-rw-r-- 1 tve mic 35582 Mar 4 01:32 streambuf.o In case you haven't encountered this (I doubt it): the funny date format for stream.o means "future". Every time I edit stream.C, I have to delete stream.o before typing pmake. grrrrr. 1017. Date: Fri, 02 Mar 90 11:38:04 PST From: sequent!fubar@uunet.UU.NET Subject: More arp trouble, plus bonus typedef problem In /sprite/src/daemons/arp/arp.c (around line 115): if (!Lookup(Net_NetToHostInt(packet.targetProtAddr), ðerAddrPtr)) { continue; } The "Net_NetToHostInt" here is wrong; the inet_addr() function in the inet library (where the internet addresses come from that are put into arp's hash table) returns inet addresses in network order; having this extra conversion here before the lookup causes arp not to function on machines with different byte order than network order. Also, the typedef for Net_InetAddress (to an unsigned int) causes the Net_ArpPacket structure to be misaligned, since longwords are 4-byte aligned. Net_InetAddress should "really" be a u_char[4], but this causes trouble for functions that wish to return type Net_InetAddress. Making Net_InetAddress a struct with four u_char entries might fix the alignment problem, but might not (I've already had to define NET_ETHER_BAD_ALIGNMENT, since the compiler does 4-byte alignment after the "real" Net_EtherHdr structure). To get arp to work for the time being, I've made the *ProtAddr fields of Net_ArpPacket into char[4]'s, and in arp all references to them look like "*((int *)packet.senderProtAddr) 1018. Date: Sun, 4 Mar 90 11:25:57 PST From: mendel (Mendel Rosenblum) Subject: Rpc serverID too large error. When I came in this morning larceny and treason were in the debugger with the message "Rpc_Call, server ID too large". I rebooted larceny and was going to debug treason but it was running a kernel I couldn't find the symbols to: SPRITE VERSION MB.004 (sun4c) (1 Mar 90 12:06:24) I was able to poke around using an incorrect symbol table. It appears that someone was trying to migrate the command "echo foo bar xyz pdq" to host 303848. The calling stack looked like: Proc_RemoteExec calls Proc_Exec calls ProcInitiateMigration calls ProcMigCommand calls Rpc_Call with host of 303848. The environment being passed to Proc_Exec looked like: TTY=/hosts/treason/rlogin2 TERM=z29 RHOST=kvetching.Berkeley.EDU MACHINE=sun4 RUSER=douglis USER=douglis HOME=/user2/douglis SHELL=/sprite/cmds/tcsh .... It seems like either the Rpc module should tolerate sends to bogus ids or the Proc module should validate user provided ids. I think the Rpc system should catch the problem. p.s. I put a patch for the problem so no more sparcStations will crash. The fix was to remove the account ("douglis") from /etc/passwd. 1019. Date: Mon, 05 Mar 90 10:10:56 PST From: Fred Douglis <douglis> Subject: sun4c deadlock again larceny died with a deadlock on Proc_Mutex and wouldn't enter the debugger or respond to L1 keys. 1020. Date: Mon, 5 Mar 90 10:14:48 PST From: mendel (Mendel Rosenblum) Subject: serious problem with allspice While checking its disk, allspice gets DMA Bus errors from its SCSI adaptors. This is very very bad. 1021. Date: Mon, 05 Mar 90 10:45:31 PST From: Fred Douglis <douglis> Subject: explanation of cc error code 1 problem regarding mendel's mail about suspending and resuming pmake, and earlier messages about migration also causing cc to return an error code of 1, i did the simplest test possible: % cc -o foo foo.c ^Z % fg % echo %status 1 (don't ask me why this didn't occur to me before... :) anyway, the following comment in machUNIXSyscall.c might explain the problem: * The routines in this file and the other associated files in this * directory (socket.c, ioctl.c, etc.) provide full binary compatible * with the following exceptions: * * 1) System call handlers are called with a Sprite context * rather than a UNIX context. This could be fixed by * setting a compatibility bit in the machine state struct * when the first UNIX system call happens and then emulating * UNIX signal handler calling conventions after that for * the process. * ... * 3) Reads that are interrupted do not restart like they * do in Sprite. The problem is that if we restart them * in here then the user never has the oppurtunity to * handle the signal. A possible solution is to return to * user mode and then when the signal handler returns restart * the system call. This will be complicated because the * arguments have to be kept around somehow. i presume the problem is related to (3) though it might also relate to (1). let's talk about how to deal with this (and when) at today's meeting. 1022. Date: Mon, 5 Mar 90 11:44:27 PST From: eklee (Edward K. Lee) Subject: diff seg faults on ds3100 Ed --- forgery% pwd /users/eklee/264/hw3/myparse3 forgery% diff . ../../hw2/myparse3 Only in .: dist Common subdirectories: ./lib and ../../hw2/myparse3/lib Only in .: symtbl.c diff: ./tc1: no such file or directory diff: ../../hw2/myparse3/tc1: no such file or directory Segmentation violation forgery% --- 1023. Date: Mon, 05 Mar 90 12:32:47 PST From: Fred Douglis <douglis> Subject: more evidence of register corruption on ds3100 my emacs periodically dies spontaneously. this time, when i debugged it, it was at a point where a "register" variable had a garbage value, yet two statements before it had set another variable to the value in the register and that value was fine. the only intervening statements were setting things pointed to by the pointer in the register. so, maybe it took a signal at the wrong time or something, or it took an interrupt in the kernel, but things weren't kosher. i'm mentioning this because it's a lot like what we see on assault when it dies with a bad value for malloc. 1024. Date: Mon, 5 Mar 90 12:40:46 PST From: shirriff (Ken Shirriff) Subject: tx clear under unix fixed. The previous bug report about clear seg faulting under tx on Unix can be deleted. The problem was that the clear program allocates a 20 character buffer for the clear string and doesn't check if the string fits. The tx clear string was 24 characters long and for some reason this killed clear on Unix but not Sprite. I shortened the tx clear string and now it works. 1025. Date: Mon, 05 Mar 90 13:47:04 PST From: Fred Douglis <douglis> Subject: address space grows without bound it just hit me that no one has reported this bug so far: this morning, allspice started hanging up badly, slowing down the system terribly. it turned out that an errant csh on gluttony was creating a massive address space (132MB or so) and was paging it all to allspice. seems to me this sort of thing has happened before. we really need the equivalent of "limit" on sprite to handle this problem. that, or at least handle such swap hogs more gracefully by degrading their priority or something? by the way, this was due to a bug in csh -- johnw's .login file had as its last line something like if (%?RHOST) setenv DISPLAY "%RHOST"\:0 and csh went bonkers. 1026. Date: Tue, 6 Mar 90 16:44:15 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: pmake broken Ken and I have been trying to run pmake (on fenugreek and thyme respectively). All it does is sit there in the wait state. Pmake.old works fine. 1027. Date: Tue, 06 Mar 90 17:40:45 PST From: Fred Douglis <douglis> Subject: assault rpc servers stuck busy i rebooted piquante. first assault hung piquante's broadcast, and then it hung up on all its prefix requests, so now it doesn't respond to rpcs by piquante at all. perhaps if we leave things in this state for the time being we can look at it during this evening's debugging session? 1028. Date: Tue, 06 Mar 90 22:52:39 PST From: Fred Douglis <douglis> Subject: coprocessor unusable on ds3100 yet another bug that piquante hit. this was while it was running a routine that was cpu-intensive but didn't use the fpu at all, so the only floating point stuff would be migd, etc. i had disabled migrations as well, so that wasn't it. 1029. Date: Wed, 7 Mar 90 08:47:36 PST From: brent (Brent Welch) Subject: Recovery bugs, SIGPIPE Loop We learned two main things last night while beating on the system. First, the recent changes I made to the recov module to prevent a race cause a deadlock instead. It is possible in kernels 1.059 and 1.060 for multiple Proc_ServerProcs to go through the recovery protocol at the same time. Using these processes up leads to deadlocks because other background processing (reaping dead processes, etc.) doesn't get done. Furthermore, eventually the Proc_Server processes get stuck on a locked process table entry - one that is locked waiting for something to be done by a Proc_ServerProc (page out?). I remember this deadlock from before. I'm pretty sure I know how to code the recovery module so it doesn't miss reboot events and also doens't have more than one process (Proc_ServerProc) doing recovery. This bug also explains the recovery loop that was experienced last week. The other thing that Mendel found was a recursive SIGPIPE handler. Apparently telnet is sometimes hooked to a pipe. When the pipe gets closed the telnet process gets a SIGPIPE. The handler for SIGPIPE writes to stderr, which in this case was the same closed pipe. Voila, recursive invokations of the SIGPIPE signal handler. I'm not sure the best way to handle this. Perhaps only one signal per pipe? Earlier in the day Mendel figured out why Allspice occasionally hangs things for long periods of time. The root of the cause is the push of the file descriptor to disk during a delete. This can queue up behind many blocks, plus the condition variable that is waited on is for the whole dirty list, not just the block in question. Thus the descriptor write can block a long time. The problem is compounded because the directory is locked during the delete, so other naming operation that pass through the directory are also blocked. One fix is to eliminate the synchronous push of the descriptor. This still makes me nervous, in spite of the block-copying support in fscheck. Other improvements include fixing the wait so it terminates when the descriptor block (not all dirty blocks) is written out. Finally, the directory may not need to be locked after the name has been removed. BUGS STILL AT LARGE. The file server's never crashed, which is good and bad. The bad news is that we could not duplicate the memory trashing problem that had been plaguing Allspice. We probably screwed up by not using the exact same kernel and the exact same dump/make dist combination that caused the other problems. Also, Allspice had a corrupted dirty list on Monday, and we dont' know how to repeat that. I'll spend a few minutes looking at the code. The DMA bus error still exists, and it occurred once last night when we rebooted Allspice. I've always seen in occur on a descriptor read (this is a multiple block read, right?) - it doesn't seem to cause damage, although it could cause damaged files to be skipped over. Finally, a couple of the crashes recently have been to bad RPC packets. The parameter block is just junk, and that eventually causes something to get an address error. Last fall I fixed some retransmission bugs in the RPC system that were causing this kind of thing. There may well be some more bugs like this. It is probably a good idea to make the RPC stubs more robust anyway, so that in the future we get error messages instead of crashes. ASSIGNMENTS: I'll fix the recov module. Perhaps Mendel can look into the delete-block-wait-directory-locking problem. Either Menel` or myself (perhaps the two of us) should also look at the code that uses the dirty list. (Remember we found that the dirty list header referenced a file that had NIL dirty list pointers.) Someone should think about the SIGPIPE handler problem. I'm also a good candidate to make the RPC stubs more robust to bad parameters. We should all think about the memory trashing bug. Our only clue is that a hash table bucket was partially overwritten with an array of <pointer, pointer, small int> structs. 1030. Date: Wed, 7 Mar 90 09:42:07 PST From: brent (Brent Welch) Subject: Mint and Allspice glitches Mint ran out of memory this morning. The malloc was for 8 bytes, so it really did run out for some semi-legitamate reason. One clue I found was that the bin for objects of size 24 bytes (this includes 4 or 8 bytes of administrative bytes) was completely used. This often indicates a core-leak for objects of that size. There used to be a VM data structure of this size with a leak, but I though ti was plugged. (I can't remember what it was - can anyone else?) Allspice got the infamous "Level 15 interrupt" and entered the debugger. I ended up just continuing it because the kernel image on rosemary was truncated. I've fixed that. We should fix the interrupt handler for level 15 interrupts. Finally, I think the global migration deamon is dead, or there was an election conflict. I think that Allspice started up a global deamon, but it used to be running on host #17 (murder). I got this clue from a PdevControlReopen conflict, #17 lost to #14. 1031. Date: Wed, 7 Mar 90 10:14:43 PST From: mendel (Mendel Rosenblum) Subject: Re: Mint and Allspice glitches >Allspice got the infamous "Level 15 interrupt" and entered the >debugger. I ended up just continuing it because the kernel image >on rosemary was truncated. I've fixed that. We should fix the >interrupt handler for level 15 interrupts. The level 15 interrupt is used on the 4/280 to signal a problem in the memory system. The two major causes are inconstancies between the virtually address cache and the page tables and memory board problems. There is no way of telling from the panic() message which one happened. Currently, the memory boards are configured to report correctable ECC errors as level 15 interrupts. It could be allspice just hit a correctable ECC error. With 128 megabytes of memory using 1 megabyte chips correctable errors are a real possibility. Cache related errors are normally caused by Sprite not flushing the cache before invalidating or changing a mapping. Currently, the level 15 interrupt handles trash some registers so continuing the kernel may or may not work depending on which registers are alive at the time of the trap.. Action items for this problem: a) Figure out why allspice is getting these errors. Patch trap handlers to report causes of problem and not trash registers. b) Patch sun4 to handle correctable ECC errors. Simple: Configure memory boards to not report correctable errors. Fancy: Catch, log, and continue correctable errors. 1032. Date: Wed, 07 Mar 90 11:01:37 PST From: Fred Douglis <douglis> Subject: procID.c running on allspice i logged into allspice because its load was over 1. i found an lpd in a loop, which died when i kill -DEBUG'ed it, a sendmail in the debugger, which didn't make much sense and which had source more recent than executable, and this: nobody a0e54 0.0 0.0 96 56 WAIT 0:00 procID.c the parent of this process is inetd. there is no entry for procID.c in inetd.conf, implying that the kernel data structure for the argument string got trashed. 1033. Date: Wed, 7 Mar 90 12:58:04 PST From: mendel (Mendel Rosenblum) Subject: fsync() broken Fsync() on a large remote file doesn't push the file to disk. The file is written to the server's cache but not to disk. It appears that the file needs to be large enought so that it doesn't fit in the local cache. 1034. Date: Thu, 8 Mar 90 08:31:28 PST From: brent (Brent Welch) Subject: migration glitch on larceny I left a kernel make running last night, and it was hung up when I got in this morning. Sage had processes on burble and larceny. Burble was wedged enough that I couldn't log in. I ran the kernel debugger on burble, but the interesting processes had goofy stacks, something like: #0 panic (__builtin_va_alist=-166539552) (sysPrintf.c line 209) #1 0xf611fea0 in sched_OnDeck () ERROR: invalid read address 0x8 During the debug session my pmake got one "Error code 16" and my syslog reported a timeout on a <send signal> and <rmt notify>. After that Sage didn't think it had processes on burble. Currently there are still migrated processes on larceny that were spawned by sage. Larceny seems to be in better shape; I'm logged in now and sending this mail. I'll leave my pmake hung here on sage, and why don't you take a look at larceny when you get in. I experienced a similar thing yesterday when I deliberately partitioned sage while it was the master of a pmake. It had 6 migrated compiles, and they all hung after the partition. The migration system did get a reboot callback (the file system did), so there may need to be some improvement there. I was able to kill off the processes with "kill -KILL", but I seem to remember that ^C and ^Z didn't have any affect. 1035. Date: Thu, 08 Mar 90 10:32:05 PST From: Fred Douglis <douglis> Subject: out of processes (or stacks) wedged system I ran a script that generated a lot of processes, and then my machine hung. since kvetching was at the time the migd master, it was getting lots of RPCs and those started timing out. I think we've seen this before. I kind of think this sort of 'sure thing' is a better bug to chase after right now than the ds3100 register tester, so i'm going to try to put in some controls to restrict user processes relative to total processes. i'm also tempted to put in a count of the number of touched pages, if that isn't there already, and let processes have large sparse address spaces but do something with them if they start touching too many pages and thrashing the swap area. comments? 1036. Date: Thu, 08 Mar 90 11:44:31 PST From: Fred Douglis <douglis> Subject: pdev use counts not tracked if a pdev is deleted, its inode is deleted even if there's a master sitting around using the pdev. this is because when line 1847 of fsLocalLookup.c is hit, curHandlePtr->use.ref == 0. this breaks migd because it can result in many instances of the migration daemon sitting around. as a temporary fix, i think i can use a separate lock file, but that's a kludge. 1037. Date: Thu, 8 Mar 90 12:12:07 PST From: douglis (Fred Douglis) Subject: recovery killed kvetching right after recovery with anise, my machine died with a TLB miss in kernel. this has happened at least twice before in the past week or two, and in each case, the stack was trashed. 1038. Date: Fri, 09 Mar 90 11:32:43 PST From: Fred Douglis <douglis> Subject: sun4 debugger bug performing "pid <pid>" where <pid> doesn't have a context causes the kernel to panic, and also there's apparently no way to backtrace the kernel stack of a process that doesn't have a context. 1039. Date: Fri, 9 Mar 90 14:21:01 PST From: brent (Brent Welch) Subject: missing ' in alias kills csh At least on the ds3100, csh goes into a horrible infinite loop that expands its address space rapidly. This is caused by an alias that is missing a single quote. This line from culler's .cshrc repeatedly causes the problem: alias xidraw 'rsh dill "~douglis/cmds.ds3100/idraw -d cardamom:0" Simply sourcing a file with this line causes csh to go off the deep end. Right now John H. is debugging hijack to see what this process is doing. Off hand he suspects a longjump problem. 1040. Date: Sat, 10 Mar 90 13:01:20 PST From: tve (Thorsten von Eicken) Subject: RPC to organo hung, then ok These days again, whenever I type msgs, I have to wait 10 sec, get a message RPC to oregano is hung, then wait 20 sec, then get RPC ok, and then msgs works. Is that normal? 1041. Date: Sat, 10 Mar 90 14:14:49 PST From: Fred Douglis <douglis> Subject: piquante, one more time yet another new (?) error for piquante: VmMach_PageValidate: Kern TLB entry found. it was running my register tester at the time, but it died on a migrated cc process. i thought perhaps the error was actually continuable, but when i tried continuing it from the debugger it just locked up completely and wouldn't reenter the debugger. i wonder, are the errors we see on piquante always related to the coprocessor? if the cpu was swapped, was the coprocessor as well? 1042. Date: Sat, 10 Mar 90 15:23:17 PST From: Fred Douglis <douglis> Subject: spritemon won't run, or link, on sun3 brent reported that he couldn't install spritemon for the sun3 because it would get a segmentation fault. i found that i could link it with the installed "libXaw_g.a" but not libXaw.a, but i couldn't debug it with just libXaw.a because it would die elsewhere. strangely enough, it would run fine the first time and die any subsequent times. i finally wound up trying to link it with all the X debug libraries, and at this point ld just returns an error status of 1 without any error messages at all. to repeat (on a sun4 or sun3): % cd /X11R3/src/cmds/spritemon % pmake sun3 the load broke when i started using libX11_g.a. i wonder whether the loader just can't handle so many large libraries or something?? by the way, thorsten, i moved the libraries to /tmp and deleted the library source directories (leaving the compressed tar files). we may need to restore the sources them once i can debug spritemon again, but in the meantime you have your space back. 1043. Date: Sat, 10 Mar 90 16:08:46 PST From: Fred Douglis <douglis> Subject: vm bug? in addition to the problem brent just mentioned, i've noticed for a couple of days that compiling on sun3s would sometimes get into a state where pmake would just sit forever. debugging it showed that malloc was doing a Sync_SlowLock on its monitor lock. seems like we should have some check in LOCK_MONITOR for processes that aren't sharing memory, or something, but in any case, there's no reason for pmake to deadlock on the monitor lock. i can't reproduce the bug in the debugger, either. 1044. Date: Sat, 10 Mar 90 17:59:03 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: ds3100 libc.a I tried to install a new libc.a by installing the etc subdirectory, then doing an "installquick" in /sprite/src/lib/c. The resulting directory had lots of stuff missing. Right now the uninstalled libc.a is significantly smaller than its predecessor. Is someone working in libc? 1045. Date: Sun, 11 Mar 90 14:22:13 PST From: mendel (Mendel Rosenblum) Subject: serial line problems on sun4c When the serial line falls out of the laserwritter in 477 the sparcstation driving it hangs until you type l1-A and continue it. I believe that tve reported a similar bug trying to output to a terminal. 1046. Date: Mon, 12 Mar 90 13:28:13 PST From: Fred Douglis <douglis> Subject: ds3100 lpd is old any reason why the ds3100 lpd hasn't been reinstalled since last july? it tried to invoke /sprite/cmds.sun3/pr. looks like the source has this fixed. 1047. Date: Mon, 12 Mar 90 15:41:06 PST From: Fred Douglis <douglis> Subject: recovery failed msg fenugreek has on its console: oregano(38) - recovering handles oregano(38) Recovery failed: <4001b> which is FS_FILE_REMOVED. is this normal? should the failure to recover a particular file keep recovery from proceeding normally (and printing "nnn handles nnn failed attempts")? there was no subsequent attempt to recover with oregano, but its's talking to oregano just fine now so it must think it recovered. 1048. Date: Mon, 12 Mar 90 17:34:54 PST From: ouster (John Ousterhout) Subject: Hung pmake After this afternoon's Oregano crash a Pmake was left hanging on Piracy. Here's the ps output: piracy: ps PID STATE TIME COMMAND 2093e EXIT 0:00 sh -ev 90921 WAIT 0:02 pmake install debug e0927 EXIT 0:00 sh -ev 20939 EXIT 0:00 sh -ev 92a EXIT 0:00 sh -ev 80943 EXIT 0:00 sh -ev 70948 EXIT 0:00 sh -ev f0924 EXIT 0:00 sh -ev 7093a EXIT 0:00 sh -ev 9091b EXIT 0:00 sh -ev 50928 RWAIT 0:00 -csh 60911 WAIT 0:00 -csh e092b WAIT 0:00 -csh 3094b EXIT 0:00 sh -ev d0916 EXIT 0:00 cc -O -Dds3100 -Dsprite -Uultrix -I. -Ids3100.md ... Fred, do you care to take a look or should I just try to kill the processes? Control-C didn't seem to have any effect. 1049. Date: Mon, 12 Mar 90 18:59:12 PST From: dedood (Paul de Dood) Subject: X Once I kill my X windows, I am unable to re-start X, unless I boot. The machine is burble. It may have something to do with internet connections I have open to other machines off campus when X is exited. 1050. Date: Mon, 12 Mar 90 21:33:25 PST From: Fred Douglis <douglis> Subject: recovery/migration bug the mail i sent before addresses the situation where a file server crashes and client migrations fail as a result. it doesn't address a situation where a network gets partitioned, and a machine with exported processes doesn't get notifications that the processes have exited from the remote machines. so, it thinks the processes are still running elsewhere. there are a couple of potential ways to get around this. one is to have the home machine periodically ping the remote machine to make sure it's still running the process. another is to have a mechanism in the recovery system to detect partitions and indicate when they occur and go away. when i asked before about whether the host would get a reboot event, and was told yes, i think there was some confusion. i think the reboot event happens only if an rpc was in progress to that host during the time of the partition, and for migrations that isn't normally the case. despite the fact that i partitioned 477 from spurnet for long enough that it should have tried to ping all the hosts i had registered recov interest in, i never saw RPC timeout messages about hosts other than file servers. 1051. Date: Tue, 13 Mar 90 10:20:41 PST From: Fred Douglis <douglis> Subject: xgoned xgoned doesn't go away when the server shuts down. 1052. Date: Tue, 13 Mar 90 12:37:45 PST From: pmchen@basil.Berkeley.EDU (Peter M. Chen) Subject: kernel 1.060 on ds3100 (new) I've had a hard time booting "new" on the ds3100's. Mustard has crashed twice on reboot now. Both times it happened after the IPserver and inetd started. Fred repeated this on kvetching. Perhaps the file rot extended to the kernel image? 1053. Date: Tue, 13 Mar 90 13:13:49 PST From: sequent!fubar@uunet.UU.NET Subject: Minor nits in attcmds/ex/ex_io.c The switch in ex_io.c (around line 422) to check for edit attempts of an executable could be a bit more portable if it were done as shown below. The Symmetry lacks an NMAGIC, but has an SMAGIC and XMAGIC format. Presumably everybody's got an OMAGIC, but it could be #ifdef'ed as well. switch ((int)head.a_magic) { case 0405: /* data overlay on exec */ case OMAGIC: /* unshared */ #ifdef NMAGIC case NMAGIC: /* shared text */ #endif case 0411: /* separate I/D */ #ifdef ZMAGIC case ZMAGIC: /* VM/Unix demand paged */ #endif case 0430: /* PDP-11 Overlay shared */ case 0431: /* PDP-11 Overlay sep I/D */ #ifdef SMAGIC case SMAGIC: /* Symmetry standalone executable */ #endif #ifdef XMAGIC case XMAGIC: /* Dynix invalid at 0 executable */ #endif error(" Executable"); 1054. Date: Tue, 13 Mar 90 14:01:29 PST From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Maybe this will work. The first two times I tried to send this, send-mail on the sun3 went into the debugger with a useless stack trace. Things seem to be bad on pride (ds3100) after mint's crash. rn died on me and now gives segmentation violations if I try to run it. I tried to investigate, but when I did a pushd, my window died with: Application terminated with status 3. A pmake I did died with: Warning: remote migd operation timed out. In my syslog I got: Warning: VmFileServerRead: Error 5 from Fs_Read or Fs_PageRead Bad user TLB fault in process 10617: pc=409570, addr=409570 Warning: Proc_RpcRemoteCall: invalid pid: c0628. Warning: Proc_RpcRemoteCall: invalid pid: 60635. Fs_PageRead: Read failed <5> Warning: VmFileServerRead: Error 5 from Fs_Read or Fs_PageRead. Bad user TLB fault in process e0628: pc=400170 addr=400170 I'll investigate some of these problems, but I wanted to send out the mail before my window system crashes or something. 1055. Date: Tue, 13 Mar 90 14:53:45 PST From: brent (Brent Welch) Subject: mint crash Mint crashed (before the recovery storm) in a strange way. Inside the Fsconsist_IOClientClose procedure a stack variable (flags) had the value of 1. However, in all the procedures above Fsconsist_IOClientClose this variable had an altogether different value (the correct value). This implies, perhaps, that the register holding this variable was corrupted. brent 1056. Date: Tue, 13 Mar 90 15:21:22 PST From: Fred Douglis <douglis> Subject: Re: kernel 1.060 on ds3100 (new) it was piquante that crashed, actually. it died right after ipServer was started, with a backtrace as follows: 0 TLBHashInsert(pid = 0, page = 2148274148, lowReg = 6029824, hiReg = 268453952) ["ds3100.md/vmPmax.c":1735, 0x800bf864] 1 .block572 ["ds3100.md/vmPmax.c":1188, 0x800beffc] 2 VmMach_PageValidate(virtAddrPtr = 0xc086bf54, pte = 3252684224) ["ds3100.md/vmPmax.c":1188, 0x800beffc] 3 VmPageValidateInt(virtAddrPtr = 0x80155360, ptePtr = 0x800c5490) ["vmPage.c":649, 0x800c3fbc] 4 PreparePage(virtAddrPtr = 0xc086bf54, protFault = 0, curPTEPtr = 0xb0000004) ["vmPage.c":1692, 0x800c56b0] 5 .block585 ["vmPage.c":1491, 0x800c5074] 6 Vm_PageIn(virtAddr = 0x10004b48, protFault = 0) ["vmPage.c":1491, 0x800c5074] 7 .block576 ["ds3100.md/vmPmax.c":1491, 0x800bf598] 8 VmMach_TLBFault(virtAddr = 0x8017e7c4 = "\210O^O\200\344\337^W\200") ["ds3100.md/vmPmax.c":1491, 0x800bf598] 9 MachUserExceptionHandler(statusReg = 64524, causeReg = 805306376, badVaddr = 0x10004b48, pc = 0x422668) ["ds3100.md/machCode.c":866, 0x80034330] 10 Mach_UserGenException(0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff) ["ds3100.md/machAsm.s":775, 0x80032e7c] 11 Compat_MapCode(status = [bad address (0xc086c028)]) [0x422664] the kernel i built last night runs just fine. also, when i continued piquante in order to reboot my kernel, it ran without a hitch! strange. btw, it's not bitrot -- i copied the unstripped kernel over and stripped it, and it is identical to the copy we are booting from. 1057. Date: Tue, 13 Mar 90 19:10:53 PST From: Fred Douglis <douglis> Subject: treason has bad segment cpp goes into the debugger consistently on treason. i disabled migration onto it so that things wouldn't get unexpected errors. i'm leaving shortly so i can't debug treason right now, but i'll try to later if no one beats me to it. 1058. Date: Tue, 13 Mar 90 21:31:10 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: PCB table full!! I just ran a script that did lots of stuff in the background. I figured the worst that could happen is that I would run out of processes. Wrong! If there are no more processes left and an Rpc_Server tries to fork you end up in the debugger. There must be a more elegant solution. Perhaps this doesn't need to be fixed right away, but it certainly does if and when we make Sprite more robust. 1059. Date: Wed, 14 Mar 90 08:33:33 PST From: bmiller (Bob Miller) Subject: 'entering debugger' msgs I don't know if anyone else has been having these problems, but I've gotten the 'entering debugger' message on a regular basis lately. Is this related to software or hardware? Here are some of the messages (I haven't kept track of all of them - there have been 5 or 6 of them over the past 3 days and more last week)... entering debugger with a TLB LD miss exception at PC 0x800b24b4 entering debugger with a TLB store address error exception at PC 0x8008621c " " " " " " " " " " " " 1060. Date: Wed, 14 Mar 90 13:59:27 PST From: brent (Brent Welch) Subject: inetd looping I caught the inetd on allspice in an infinite loop. I think its problem concerns my recent change to select. Its select call was returning that a particular socket was readable. However, when it did a recvfrom() on the socket it got a ESTALE return code. It doesn't close the socket in response to this error (I could patch it to do this) and so the next select again returns that this stream is readable. This particular socket is used with requests for the time, and it apparently keeps it open. The main problem is that select just doens't work well with bad I/O streams... 1061. Date: Wed, 14 Mar 90 16:44:14 PST From: elm (ethan miller) Subject: rsh to rosemary It doesn't seem to work. I don't know whether this is a Sprite problem or a rosemary problem, though. I get a message like this when I try it: rosemary.Berkeley.EDU: address already in use I'm trying to run an xterm remotely on rosemary (since there is no xterm on Sprite), but I get the same message when I try a remote ls. 1062. Date: Thu, 15 Mar 90 10:31:42 PST From: mendel (Mendel Rosenblum) Subject: ds3100 as goes infinite on / 0 If you try to compile the following program on ds3100 Sprite the assembler goes into a inifinite loop: main() { int s; return s / 0; } The "/ 0" produces the message "as1: Warning: t2.c, line 1: Division by zero" and loops inifinitely. 1063. Date: Fri, 16 Mar 90 11:44:17 PST From: culler (David Culler) Subject: Update on Allegro lisp The patched Allegro lisp mostly runs on sprite and I have been able to adjust the pathnames so it can find library files and the like. It handles ascii files correctly, but encounters an error in closing binary files. I have not located a source for this, although I can isolate and disassemble the particular functions. Aside from the error in closing, binary files seem to be read correctly. 1064. Date: Fri, 16 Mar 90 14:29:17 PST From: Fred Douglis <douglis> Subject: fmt.h/machTypeMips.c i keep seeing complaints of the form: /sprite/lib/include/fmt.h:56: warning: FMT_MY_FORMAT redefined compiling (on a sun) machTypeMips.c. that's because ds3100 is defined too. 1065. Date: Fri, 16 Mar 90 15:44:13 PST From: Fred Douglis <douglis> Subject: sun4c migration bug? i've been seeing some random error code 16's on sun4c's today. i can't look into it too carefully just yet, but wanted to make sure people were alert to the possibility (and to repeatable test cases). this was particularly nasty because i was doing an install of proc, and after removing and reinstalling the sources in Installed/proc 3 times, it removed them and failed to install them. (i'd be much happier if only files that didn't match current sources were removed, since this would also avoid any synchronization problems due to installing multiple machine types at once.) Fred 1066. Date: Sat, 17 Mar 90 18:35:05 PST From: tve (Thorsten von Eicken) Subject: MIPS compiler bug I have two octtools programs which bomb if I compile them with -O on a ds3100, but which run fine on other machines and without -O on ds3100s. For one of them I checked the assembly output with -O and there obviously was a bug. In two identical loops (identical source, char by char, one loop after the other) the second one had two instructions swapped which made it loop indefinitely. I haven't checked the output for the second program. Do we have an old compiler binary? The same programs seem to compile fine on DECstations over in cory (I haven't checked which compiler they use). Thorsten NB: this is just for information... 1067. Date: Sun, 18 Mar 90 16:26:36 PST From: tve (Thorsten von Eicken) Subject: ds3100 goes into debugger reliably running dbx I have a program in ~casotto/dmtest which opens an X connection to a decstation over in cory. The XOpenDisplay call hangs in the select system call for no obvious reason. I do a "kill -DEBUG" while it's hanging to have a look at it with "dbx -attach ... ./dmtest", type "run" and "where" to verify it's in select, and the type "c", and voila: gluttony is no more. I don't know what the console say, but John can probably tell you :-), I tried it twice and both times it died. To reproduce on another ds3100, it requires an xhost on the remote decstation in cory. 2 notes: the program is an ultrix binary and the program works perfectly fine when trying to open a connection to another sprite machine. Mysterious... 1068. Date: Sun, 18 Mar 90 16:36:59 PST From: Fred Douglis <douglis> Subject: monitor blackout this was sent just to me, but i don't think it's clearly something relating only to my own kernel: >>>>> On Sun, 18 Mar 90 15:54:07 PST, eklee@sprite.Berkeley.EDU (Edward K. Lee) said: Ed> I was running fred 1.115 when my monitor blanked and I could not get it Ed> to unblank (hitting keys would cause various parts of the screen to flash Ed> briefly but it would not stay unblanked). Ed - what were you doing at the time it blanked? were you 30 seconds idle (meaning something could have migrated onto you) or were the only active processes your own? were you running your simulator? also, what did you do at this point? did you happen to try the l1 key that is supposed to enable the display? Fred 1069. Date: Sun, 18 Mar 90 19:03:38 PST From: eklee (Edward K. Lee) Subject: sassafras hung with swap error When I ran 'nm -g /sprite/src/kernel/eklee/sun4 > t' on sassafras, it hung with messages of the type: Warning: VmOpenSwapFile: Could not open swap file /swap/29/51, reason 0x1 printed to its console. It didn't crash but would not respond to pings. I tried to repeat it without success. 1070. Date: Sun, 18 Mar 90 22:51:41 PST From: shirriff (Ken Shirriff) Subject: /swap1 confused % ls /swap1 /swap1/28 not found /swap1/29 not found /swap1/3 not found /swap1/30 not found 1/ 22/ 39/ 56/ 70/ ...etc... Why are these not found? Is the file system messed up? 1071. Date: Mon, 19 Mar 90 01:34:10 PST From: eklee (Edward K. Lee) Subject: select I/O error I was running pmake on sassafras when it hung (would not respond to pings). I'll leave it in this state for a while in case someone wants to look at it. Ed --- On the console were messages of the form: Fs_Dispatch select error: I/O error <28>Mar 19 00:51:00 inetd[71d0d]: select: I/O error <28> ... select: I/O error <28> ... select: I/O error <28> ... select: I/O error <28> ... select: I/O error <28> ... select: I/O error <28> ... select: I/O error <27>Mar 19 00:51:32 inetd[71d0d]: Exiting: Too many select errors ^---[sic] 1072. Date: Mon, 19 Mar 90 11:06:54 PST From: shirriff (Ken Shirriff) Subject: Re: monitor blackout The same sort of blanking just happened to me. The screen was blanked with the screensaver when I came in, so I hit a key. Part of the screen unblanked and then the whole screen blanked again. (It looked like the screen was unblank long enough to redraw about 20% of the screen and then went blank.) I hit a bunch more keys, and the screen would flash on and off. Finally it unblanked and stayed. It was like I had to fight with the screensaver for control. This was with ds3100 kernel 1.061. 1073. Date: Mon, 19 Mar 90 12:54:22 PST From: Fred Douglis <douglis> Subject: Re: pmake hangs on sun3 Pmake hangs on sun3. It works if invoked as "pmake -X". [this was on fenugreek, as reported by mendel.] it didn't work with -X for me either. i'm pretty sure this is the bogus VM segment bug we discussed at last week's meeting. pmake doesn't get very far at all because its first malloc blocks on the monitor lock. ("pmake -d m" should print stuff when it goes to stat files, and it doesn't print anything at all so it's not getting far enough to be doing migration or much of anything else.) i tried debugging it. first allspice hung left and right, and then just when i was debugging it using kgdb, the problem cleared up on its own. this could be related to it going through recovery with allspice; who knows? we'll just have to keep watching for machines (esp. sun3s, it seems) getting into this state. 1074. Date: Mon, 19 Mar 90 17:18:16 PST From: brent (Brent Welch) Subject: junky file descriptor Allspice was having trouble with some directories in /swap1 today. On a hunch I tried flushing its cache (fscmd -f) and voila, the problems cleared up. This implies that there was a bad cache block, one that didn't really have file descriptors in it. yuck. 1075. Date: Tue, 20 Mar 90 17:03:19 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: mx redraw bug If I use the "Forms" menu to copy a form into an mx window then the window is not redrawn correctly. The menu obscures the top of the window, but when the form is added the obscured area is scrolled downwards. It appears that mx then redraws the top of the window instead of the correct area. The result is a square white box where the menu used to be. To reproduce the bug run mx on a new file, then select the "local.mk" option of the "Forms" menu. 1076. Date: Tue, 20 Mar 90 17:48:14 PST From: Fred Douglis <douglis> Subject: piracy impossible bus error piracy was in the debugger. just for the hell of it, i ran kdbx on it, and i found that it reported a bus error in an assignment statement: Bus error [.block185:1093 ,0x8005f050] fs_Stats.blockCache.numFreeBlocks--; the assembler instruction at %pc was at the starred instruction in: [Fscache_FetchBlock:1088, 0x8005f04c] nop >*[Fscache_FetchBlock:1093, 0x8005f050] lui r24,0x8013 [Fscache_FetchBlock:1093, 0x8005f054] lw r24,9508(r24) seems pretty bizarre. sounds like a hardware error rather than a bad address. on the ds3100 front, we still don't have piquante running ultrix. the vaxenfixen have been working on it now for about a week. they started playing w/ dill last wednesday; did little on thursday or friday to try to get piquante going because i guess they lacked some sort of install tape; and finally started actively trying to boot piquante sometime yesterday or this morning. no luck so far. must be a great advertisement for sprite, how easily we can add a new diskless machine, huh? 1077. Date: Tue, 20 Mar 90 22:46:35 PST From: Fred Douglis <douglis> Subject: reboot doesn't close file, reclaim space? ken had a process on nutmeg in a tight loop writing data to a file in /user1. removing the file didn't free up the space since the file was open. nutmeg then rebooted, and the space still didn't free up. this seems odd. is it expected behavior? it looks like we may have to reboot allspice to get the file in lost+found where it can be deleted. since it's huge, and /user1 is full, this may happen tonight. 1078. Date: Wed, 21 Mar 90 11:38:16 PST From: Fred Douglis <douglis> Subject: non-idempotent pdev recovery after a crash, i don't expect pseudo-devices to recover the way normal files do. after a network partition, though, there's no reason whatsoever for pdevs to get nuked because of failed recovery. this is because there's really no reason to do recovery on them at all. i've had some trouble with my ethernet connection on kvetching today, and each time i lost my connection, my remote emacs and spritemon windows died. 1078. Date: Wed, 21 Mar 90 17:10:47 PST From: mendel (Mendel Rosenblum) Subject: cc1.sparc bug Cc1.sparc goes into the debugger when faced with the following program: foo(collapse) char collapse; { char* bp; ((*bp = '%') != collapse); } This fragment is from the Xt library. 1079. Date: Thu, 22 Mar 90 18:43:20 PST From: Fred Douglis <douglis> Subject: rdist symlink bug fixed, but stat needs fix too ever notice how rdist wouldn't update symbolic links because it would complain 'file changed size'? well, i got fed up with seeing that (and fed up with writing for the moment :) so i looked into rdist. i found where it checks the length -- using both readlink and stat and getting different lengths even on the local host. readlink is fixed to return the unix-equivalent length of a symbolic link, but stat counts the null byte. would programs break if stat were changed to make the same check for name length? in the meantime, rdist fixes the count itself, and it works much better now. 1080. Date: Fri, 23 Mar 90 11:49:52 PST From: Fred Douglis <douglis> Subject: mint's arpd died.... ... so hosts that didn't already have entries for sprite hosts couldn't talk to us. i couldn't rlogin to mint, perhaps for that reason, but i was able to start arpd on fenugreek and things are much better now. 1081. Date: Sat, 24 Mar 90 10:05:06 PST From: root (The Sprite God) Subject: csh on Allspice There is a csh process on Allspice that is looping in Sig_SetHoldMask. I've suspended the process so that a signal expert can take a look. PID 0x0e3e 1082. Date: Sun, 25 Mar 90 17:55:27 PST From: mendel (Mendel Rosenblum) Subject: tar goes into debugger. Tar goes into the debugger on sun4 (and probably other machines) if it input stream doesn't exists. For example: jaywalk% notfound | tar xf - notfound: Command not found. Assertion failed: (head) line 50 of "list.c" Debug [5] 71216 d1223 1083. Date: Tue, 27 Mar 90 13:13:36 PST From: Fred Douglis <douglis> Subject: oregano ipServer & other processes vanished oregano acted strangely this morning. i noticed that "rup" showed it was down, so I ran "f =ps@oregano" and that worked, showing no migd running. But an rsh onto oregano just hung, and then pinging oregano stopped working. Migrating onto oregano worked and I was able to find that the ipServer was totally gone. Since the "fixIPServer" script only checked for the ipServer in the debugger, it didn't restart it. I just changed the script to see if the process exists at all, and the mail at 13:10 from oregano (and fenugreek) is due to this change. 1084. Date: Thu, 29 Mar 90 11:22:25 PST From: brent (Brent Welch) Subject: distribution I've just spent over an hour on the phone with the guy at DEC. We patched /sprite/src/lib/c/net/gethostnamadr.c, the gethostbyname() procedure to always fall back to the "/etc/hosts" file if the name server doesn't respond. We had to manually update /sprite/lib/ds3100.md/libc.a because a make in /sprite/src/lib/c failed. The disk sub-directory is out-of-date, probably because of the recent changes made there. He still has "diskUtils.h", for example (which doesn't compile), while our source tree has a "disk.h". Typing "make install" in the /sprite/src/daemons directory failed for a number of reasons. Mainly there are not ds3100.md directories everywhere, plus there are some missing man pages. /boot/bootcmds was modifies so that the if (%MACHINE == ds3100) then sethostname `hostname` sethostids endif sequence was moved way up, before everything. The sethostname program has to be run before any socket() calls are made. The kernel-version of the compatibility library depends on a variable, machHostName, which is only set by this call. That's all I can think of. So, while we can perhaps make a kernel from the distribution, we cannot make the library or all the deamons and commands. 1085. Date: Thu, 29 Mar 90 11:57:04 PST From: mgbaker (Mary Gray Baker) Subject: killed vi freezes shell I did a kill -KILL of a vi process, and my csh froze in that window. 1086. Date: Fri, 30 Mar 90 15:50:59 PST From: pmchen (Peter M. Chen) Subject: from my syslog (this was from mustard--ds3100) PdevWrite: signal 14 PdevWrite: signal 14 Any ideas what this means? 1087. Date: Fri, 30 Mar 90 16:38:00 PST From: Fred Douglis <douglis> Subject: mint not accepting rdates finger & such worked, but rdate got refused. killing and restarting inetd did the trick. perhaps we should run inetd with debugging info enabled so we can see what it's thinking about when it goes off the deep end? 1088. Date: Fri, 30 Mar 90 17:21:51 PST From: root (The Sprite God) Subject: rawstat in debugger on nutmeg Does anyone know why nutmeg had a whole pile of: root a036d 0.0 0.0 160 0 DEBUG 0:00 rawstat -all root 2036e 0.0 0.5 72 40 WAIT 0:00 sh -c /c/stats/RAW processes, using up all the available processes? 1089. Date: Mon, 2 Apr 90 08:39:45 PDT From: ouster (John Ousterhout) Subject: More mice It makes perfect sense that this would happen the day that Brent starts at PARC.... A bunch of file corruptions were detected by my checksum program last night, after a couple weeks without problems. The corrupted files are listed at the end of this message. When I examined the files, some of them didn't appear to be corrupted after all, but some definitely did. Some of the corrupting material seems to be from the workshop position paper that Mary is preparing. I notice that Allspice was rebooted just before midnight last night... Mary, were you working on the report at the time of the reboot? Does anyone know anything about the circumstances of the reboot? Was it an ugly crash? Here's the checksum output: Checksum started at Mon Apr 2 04:35:18 PDT 1990 Running on allspice.Berkeley.EDU ./jhh/proj/user/lockstat.begin corrupted: id 11045 mtime 26140af9 old 627d2b47 new a31ef466 ./jhh/proj/user/sysstat.begin corrupted: id 11049 mtime 26140afd old ec68620c new a0f6a69c ./jhh/proj/user/lockstat.end corrupted: id 11050 mtime 26140955 old 837b32c4 new ee1e9809 ./ouster/tmp/a.out corrupted: id 556 mtime 26150fa7 old 0 new ce90cfef ./jhh/proj/user/sysstat.end corrupted: id 11051 mtime 26140959 old 5c2425a5 new 703fbe8e ./shirriff/.newsrc1 corrupted: id 44990 mtime 2615045e old 827d6365 new 19355015 ./shirriff/.plan corrupted: id 10163 mtime 2614ff8d old a7c84178 new 39feb394 7 errors found Checksum completed at Mon Apr 2 05:11:15 PDT 1990 1090. Date: Mon, 2 Apr 90 17:21:03 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: junk file bug I spoke briefly with Brent about the file inconsistency bug. When a file is closed the attributes, including size, are pushed back to the file server. If there are cache blocks to back up the file, then their (garbage) contents will be provided to the next open. The file size is used to determine how much of the cache blocks are garbage. In this case the size is incorrect. We aren't certain why there are cache blocks allocated to the file if the file was just created. There may be a bug here somewhere. Brent agreed that incrementing the version number if the consistency callback fails sounds like a good idea. He is concerned that the real bug is that we don't handle attributes correctly. Perhaps they shouldn't be pushed to the file server until all the blocks of the file have been. 1091. Date: Wed, 4 Apr 90 15:18:31 PDT From: shirriff (Ken Shirriff) Subject: ds3100 pc messed up in 1.061 Forgery crashed with an illegal instruction error. It was running 1.061 kernel. The stack trace and pc seem to be messed up: 0 Net_EtherAddrToString.Net_EtherAddrToString(0x1, 0xffffffff, 0x80072a7c, 0xc0261808, 0x1000e2d8) [0x8e800070] 1 Mach_TestAndSet(0x1, 0xffffffff, 0x80072a7c, 0xc0261808, 0x1000e2d8) ["ds3100.md/machAsm.s":1064, 0x80033100] 2 Compat_MapCode(status = 0) [0x1001b] 3 Compat_MapCode(status = -1667522559) [0x1001b] 4 Compat_MapCode(status = 63) [0x1000e2d4] 5 Compat_MapCode(status = 268493528) [] The pc is pointing into the text segment, not the code segment. 1092. Date: Wed, 04 Apr 90 17:25:17 PDT From: Fred Douglis <douglis> Subject: new account setup i think something must be broken with the program that sets up new accounts, or not all fields are being filled in properly. when i fingered the two students [bsmith, tockey] i'm supposed to 'shepherd', to see if they're ever using sprite, they not only didn't have .project files (i mentioned this before but didn't hear anything from anyone) but they also didn't have mail set up to be forwarded anywhere. mail to sprite-users may collect in their mailboxes on sprite without being read in a timely fashion. 1093. Date: Thu, 5 Apr 90 00:03:02 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: allspice debugging report I was unable to determine the cause of the dma bus errors. I rebooted allspice 4 times and got the bus errors on three of them. Each time the controller wasn't active. It did not have a scsi command in progress and the dma was off. I don't know what to make of this. Right now allspice is running my kernel. It is made of all installed modules except for dev. Please reboot it with the .new kernel if it crashes. 1094. Date: Fri, 06 Apr 90 09:27:00 PDT From: rab (Robert A. Bruce) Subject: allspice crash Allspice crashed. Fatal Error: Fscache_OkToScavange: NILL on dirty list (continuable) It kept trying to continue, but would just reenter the debugger with the same error over and over. When I tried to debug it, it kept popping out of the debugger. 1095. Date: Fri, 6 Apr 90 16:06:30 EDT From: douglis@piquante.berkeley.edu (Fred Douglis) Subject: recovery failure kvetching hung during recovery with allspice, when allspice was totally wedged up witha huge swap file from heresy. as everyone else was recoverying kvetching froze completely. an li-y dump showed "want recovery reboot callbacks failure srv-inprogress" or something like that. 1096. Date: Fri, 06 Apr 90 14:25:15 PDT From: rab (Robert A. Bruce) Subject: mint crash Mint crashed. It was running 1.061, I rebooted it with 1.062. Fatal Error: Fscache_OkToScavenge: FSCACHE_FILE_BEING_WRITTEN (continuable) #0 panic (_va_args=235020820) (sysPrintf.c line 209) #1 0xe0223a4 in FscacheBlockOkToScavenge (cacheInfoPtr=(struct Fscache_FileInfo *) 0xe3f8f90) (fsBlockCache.c line 3013) #2 0xe022fa0 in Fscache_OkToScavenge (cacheInfoPtr=(struct Fscache_FileInfo *) 0xe3f8f90) (fsCacheOps.c line 432) #3 0xe0284f2 in Fsio_FileScavenge (hdrPtr=(struct Fs_HandleHeader *) 0xe3f8f50) (fsFile.c line 791) #4 0xe039a00 in Fsutil_HandleInstall (fileIDPtr=(struct Fs_FileID *) 0xe8b7eb8, size=68, name=(char *) 0xe68a5e8 "passwd", hdrPtrPtr=(struct Fs_HandleHeader **) 0xe8b7eb4) (fsHandle.c line 310) #5 0xe02ab9a in Fsio_StreamCreate (serverID=32, clientID=66, ioHandlePtr=(struct Fs_HandleHeader *) 0xe424a80, useFlags=36865, name=(char *) 0xe68a5e8 "passwd") (fsStream.c line 108) #6 0xe027d40 in Fsio_FileNameOpen (handlePtr=(struct Fsio_FileIOHandle *) 0xe424a80, openArgsPtr=(struct Fs_OpenArgs *) 0xe3e3398, openResultsPtr=(struct Fs_OpenResults *) 0xe2d182c) (fsFile.c line 308) #7 0xe02bdbe in FslclOpen (prefixHandlePtr=(struct Fs_HandleHeader *) 0xe11f984, relativeName=(char *) 0xe3e3798 "etc/passwd", argsPtr=(char *) 0xe3e3398 "", resultsPtr=(char *) 0xe2d182c "", newNameInfoPtrPtr=(struct Fs_RedirectInfo **) 0xe8b7f7c) (fsLocalDomain.c line 178) #8 0xe03757e in Fsrmt_RpcOpen (srvToken=(ClientData) 0xe3e2838, clientID=66, command=7, storagePtr=(struct Rpc_Storage *) 0xe8b7fc4) (fsSpriteDomain.c line 386) ---Type <return> to continue--- #9 0xe05cc58 in Rpc_Server () (rpcServer.c line 199) #10 0xe05fa48 in Sched_StartKernProc (func=(void (*)()) 0xe05ca28) (schedule.c line 944) (gdb) print *cacheInfoPtr %1 = {links = {prevPtr = 0xe0819b0, nextPtr = 0xe1498b0}, dirtyList = {prevPtr = 0xe3f8f98, nextPtr = 0xe3f8f98}, blockList = {prevPtr = 0xe3f8fa0, nextPtr = 0xe3f8fa0}, indList = {prevPtr = 0xe3f8fa8, nextPtr = 0xe3f8fa8}, lock = {inUse = -2147483648, waiting = 0, name = 0xe01fa2a "Fs:perFileCacheLock", holderPC = 0xe0819c0 "\016\b\031\300\016\b\031\300\016\b\031\310\016\b\031\310\016\r\336\374\016\016\320p", holderPCBPtr = 0xe30bcf8 "\016\t\264\270\016\t\264\270"}, flags = 0, version = 2, hdrPtr = 0xe3f8f50, blocksInCache = 0, blocksWritten = 0, numDirtyBlocks = 0, noDirtyBlocks = {waiting = 0}, lastTimeTried = 0, attr = {firstByte = 0, lastByte = 9215, accessTime = 639428963, modifyTime = 639428782, createTime = 555793808, userType = 5, permissions = 511, uid = 0, gid = 155}, ioProcsPtr = 0xe076de8} -bob 1097. Date: Fri, 06 Apr 90 14:57:08 PDT From: Fred Douglis <douglis> Subject: hosts still not invoking recovery automatically a whole bunch of hosts are listed as "down", all from when mint crashed. mint was running the global migd at the time. taking treason as an example, "f =ps@treason" got a response showing the host was up, but the ps never produced anything. then when i logged in, things went back to normal, including la showing treason as up again. a glance at the syslog showed that mint rebooted at 14:10 but recovery didn't start until i logged in at 14:48! i think the same held true of tyranny earlier when mendel logged into it to check it out. 1098. Date: Fri, 6 Apr 90 17:14:13 PDT From: elm (ethan miller) Subject: compiler bug dealing with 64-bit integers There is a compiler bug dealing with multiplying and/or dividing 64-bit integers. There is an unexplained sign changed (and perhaps more) when this is done. Sample code fragment follows: inline int64 rtc_to_us (rtc_val) int64 rtc_val; { return ((rtc_val * (int64)5998) / (int64) 1000000); } When I called this procedure with a number which is in the 100s of thousands, I get a negative result. Clearly, this isn't because of overflow, since 1000000 * 5998 should still fit into a 64-bit integer. When I converted the int64s to doubles and did the calculations, I got the correct results. The compiler in question is the sun4 compiler, which I have been running on my SparcStation (terrorism). 1099. Date: 6 Apr 90 12:04:11 PDT (Friday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: allspice crash Fatal Error: Fscache_OkToScavange: NILL on dirty list (continuable) This panic occurs when the background scavenging checks the dirty list and finds NIL pointers in it. The fact that it doesn't repair anything means that, indeed, it will panic every time it scavenges a handle. This means the panic is not continuable, obviously. NIL pointers result from a handle being removed and reused while it is still on the dirty list. The two places where a handle is removed is by the scavenger and when a file is deleted. I had hoped that the addition of Fscache_Delete in 1.062 had fixed the problems. Apparently there is still a race in which a dirty file that is being deleted can get stuck back on the dirty list. There is a check in PutFileOnDirtyList against FSCACHE_FILE_GONE, which is set while a file is being deleted, but apparently this isn't good enough. 1100. Date: Sat, 7 Apr 90 21:13:37 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: lost print jobs If you queue a job while the machine connected to the printer is down the job will be lost when the machine reboots. 1101. Date: Sun, 8 Apr 90 11:08:06 PDT From: ouster (John Ousterhout) Subject: Another mouse The checksum program found a corruption in /sprite/doc/ref.ancient/cmds/RCS/labeldisk,v on Friday morning. It looks like it inherited a piece of thorsten's mail file. 1102. Date: Sun, 8 Apr 90 13:46:36 PDT From: ouster (John Ousterhout) Subject: Uptime and loadavg There was no "uptime" command in /sprite/cmds.sun4. I noticed that it's just a symbolic link to "loadavg" for other machines, so I created a corresponding link in /sprite/cmds.sun4. However, it seems to me that there should be an entry in the local.mk file in /sprite/src/cmds/loadavg to do this automatically. It looks like the loadavg program is still in a state of flux so I didn't go ahead and add the entry. Fred, can you take care of this? 1103. Date: Mon, 09 Apr 90 08:24:14 PDT From: rab (Robert A. Bruce) Subject: violence When Bob Miller came in this morning violence had error messages scrolling accross the screen several times a second. Each message said: cause=10002814 SR=25c00000 excPC=85046fe8 SP=8522b875 BVA=8f253271 He has had a lot of trouble with violence lately, so we replaced it with subversion. Violence is now in 608-4. 1104. Date: Mon, 9 Apr 90 19:57:07 PDT From: mgbaker (Mary Gray Baker) Subject: trashed mail Is this explained by things we know already? I had a double-message that combined mendel's spring cleaning list and a message from Kathryn Crabtree. The first time I read these messages, they were separate. When I just read them, they were combined. 1105. Date: Tue, 10 Apr 90 15:29:43 PDT From: Fred Douglis <douglis> Subject: mint not serving rdate "rdate mint" gets connection refused. wonder if this has something to do with the fact that many machines invoked recovery w/ mint at 4am, which is when all the machines run rdate against mint. last time this happened, killing & restarting inetd fixed the problem. this means that debugging inetd would probably be useful. i'm going to pass, but if anyone else is interested, it's there. if no one indicates by tonight that they plan to look into it, i'll restart it late tonight. 1106. Date: Tue, 10 Apr 90 17:02:55 PDT From: sequent!fubar@uunet.uu.net Subject: Awk null pointer problems In /sprite/src/attcmds/awk/awk.def, the macros "isstr()", "isfld()" and "isrec()" need to be changed; the provided versions will attempt to dereference null pointers (n.optr). A context diff is appended. *** attcmds.old/awk/awk.def Mon Jul 11 09:57:08 1988 --- attcmds/awk/awk.def Tue Apr 10 16:49:20 1990 *************** *** 123,133 **** #define isbreak(n) (n.otype == OJUMP && n.osub == JBREAK) #define iscont(n) (n.otype == OJUMP && n.osub == JCONT) #define isnext(n) (n.otype == OJUMP && n.osub == JNEXT) ! #define isstr(n) (n.optr->tval & STR) #define istrue(n) (n.otype == OBOOL && n.osub == BTRUE) #define istemp(n) (n.otype == OCELL && n.osub == CTEMP) ! #define isfld(n) (!donefld && n.osub==CFLD && n.otype==OCELL && n.optr->n val==EMPTY) ! #define isrec(n) (donefld && n.osub==CFLD && n.otype==OCELL && n.optr->nv al!=EMPTY) obj nullproc(); obj relop(); --- 123,135 ---- #define isbreak(n) (n.otype == OJUMP && n.osub == JBREAK) #define iscont(n) (n.otype == OJUMP && n.osub == JCONT) #define isnext(n) (n.otype == OJUMP && n.osub == JNEXT) ! #define isstr(n) (n.optr != NULL && n.optr->tval & STR) #define istrue(n) (n.otype == OBOOL && n.osub == BTRUE) #define istemp(n) (n.otype == OCELL && n.osub == CTEMP) ! #define isfld(n) (!donefld && n.osub==CFLD && n.otype==OCELL && \ ! n.optr != NULL && n.optr->nval==EMPTY) ! #define isrec(n) (donefld && n.osub==CFLD && n.otype==OCELL && \ ! n.opt != NULL && n.optr->nval!=EMPTY) obj nullproc(); obj relop(); 1107. Date: Wed, 11 Apr 90 18:05:13 PDT From: pmchen (Peter M. Chen) Subject: unable to rlogin I was not able to rlogin from envy or coriander to any sprite machines. (This was 5 minutes ago). Now I'm able to. Weird. Symptoms were: I tried "rsh mustard" and got hung (had to ~. to get out). I also tried rsh allspice, assault, and oregano with no better luck. I was able to ping and finger, though. When I tried rsh allspice -l root, I got the "Password:" prompt, but nothing after that. 1108. Date: Thu, 12 Apr 90 00:40:29 PDT From: lowery (Carlyn M. Lowery) Subject: Minor Comment on Some Documentation This is not a bug, but a misleading bit of documentation. In /sprite/src/kernel/rpc/rpcPacket.h, the following description is given: RPC_CLOSE only valid on type RPC_ACK messages. This means the client has successfully gotten its last reply and is ending the sequence of RPCs with the server. It should say: RPC_CLOSE only valid on type RPC_ACK messages. This means the server is requesting acknowledgement of its last reply so it can reassign the server process to an active client channel. When combined with RPC_SERVER, this means the client has successfully gotten its last reply. I reached this conclusion after examining the code. If I've misunderstood, please let me know. 1109. Date: Thu, 12 Apr 90 08:31:55 PDT From: ouster (John Ousterhout) Subject: Corrupted file The checksum program detected a corruption in the file /sprite/users/hilfingr/mp/enbsigfifo.o. Bob, can you restore this file from tape so Hilfinger never knows he was hit? -John- 1110. Date: Thu, 12 Apr 90 17:04:48 PDT From: rab (Robert A. Bruce) Subject: allspice crash Allspice crashed. It was running 1.063. Fatal Error: Fscache_DeleteFile failed #1 0xf603bd58 in Fscache_DeleteFile (cacheInfoPtr=(struct Fscache_FileInfo *) 0xf71dbb00) (fsCacheOps.c line 1372) #2 0xf6043024 in Fsio_FileTrunc (handlePtr=(struct Fsio_FileIOHandle *) 0xf71dbac0, size=0, flags=2) (fsFile.c line 1711) #3 0xf6049580 in Fslcl_DeleteFileDesc (handlePtr=(struct Fsio_FileIOHandle *) 0xf71dbac0) (fsLocalLookup.c line 1919) #4 0xf6041b74 in Fsio_FileCloseInt (handlePtr=(struct Fsio_FileIOHandle *) 0xf71dbac0, ref=0, write=0, exec=0, clientID=14, callback=1) (fsFile.c line 663) (gdb) print *cacheInfoPtr %1 = {links = {prevPtr = 0xf60cab70, nextPtr = 0xf6d3acf0}, dirtyList = {prevPtr= 0xf71dbb08, nextPtr = 0xf71dbb08}, blockList = {prevPtr = 0xf71dbb10, nextPtr= 0xf71dbb10}, indList = {prevPtr = 0xf71dbb18, nextPtr = 0xf71dbb18}, lock = {inUse = -16777216, waiting = 0, name = 0xf6035c52 "Fs:perFileCacheLock", holderPC = 0xf6092668 "\177\375\350\032\001", holderPCBPtr = 0xf69f67b8 "\366*\261\220\366*\261\220"}, flags = 2176, version = 1157, hdrPtr = 0xf71dbac0, blocksInCache= 0, blocksWritten = 0, numDirtyBlocks = 0, noDirtyBlocks = {waiting = 0}, lastTimeTried = 0, attr = {firstByte = -1, lastByte = -1, accessTime = 639957866, modifyTime = 639957881, createTime = 639949106, userType = 5, permissions = 416, uid = 0, gid = 0}, ioProcsPtr = 0xf60bd3a8} 1111. Date: Thu, 12 Apr 90 17:28:39 PDT From: Fred Douglis <douglis> Subject: bug in ftp fixed: /dev/tty problem thank you for reporting this bug. the problem with ftp was that ftp tries to open /dev/tty and uses stdin only if it doesn't get tty. normally tty doesn't exist, but ken managed to create it: [kvetching 17:16]/user2/douglis (8)% ls -l /dev/tty -rw-rw-r-- 1 shirriff 6 Apr 9 21:39 /dev/tty i removed it and now ftp works again. i now realize this has happened before. unfortunately, if it was in the log, it was in the old sprite log, so i didn't turn it up when i tried to see if we had a record of it. hopefully this will serve next time. a better fix, of course, would be to support /dev/tty! someday... 1112. Date: 12 Apr 90 17:39:24 PDT (Thursday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: allspice crash If Fscache_DeleteFile fails it means (I think - no Sprite access right now) that the file is on the dirty list. The flags value in Bob's message is 2176, which (I think...) is FSCACHE_FILE_GONE | FSCACHE_FILE_ON_DIRTY_LIST It may be possible to recover from this case by simply removing the file from the dirty list. As I've said before, PutFileOnDirtyList checks against FSCACHE_FILE_GONE. Also, CacheFileInvalidate, which is called to during a deletion and sets FSCACHE_FILE_GONE, is supposed to "do the right thing" with files on the dirty list. Brent Welch 1113. Date: Thu, 12 Apr 90 18:15:33 PDT From: mgbaker (Mary Gray Baker) Subject: second allspice crash The second crash on allspice was due to a panic in Fsutil_HandleReleaseHdr on a file that a client thought it had locked but that allspice didn't think it had locked. My guess is that this could somehow have been a result of the first crash on allspice. Other info: the reference count on the file was 2. It's name was "erwin". Crackle was opening it on allspice. The request header showed the server hint from crackle as being host number 19 (ponca), which has been out of commission for a while. But crackle has been up for a while, so if it hasn't been heavily used, it could still have a channel with an old server hint in it. How many days ago did ponca cease to exist? Crackle has been up for about 3 days. 1114. Date: Thu, 12 Apr 90 18:19:55 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: SPRITE_OS variable If I login to a machine the environment variable SPRITE_OS is set to "yes". If I rlogin to a machine it is not set to anything. 1115. Date: Fri, 13 Apr 90 08:33:42 PDT From: ouster (John Ousterhout) Subject: More corruption The file /sprite/users/david/sequent/patpg/sun3.md/md.mk was corrupted sometime yesterday. The intruding data is as follows: t def Metrics begin /.notdef 0 def /space 500 def /ru 500 def /br 0 def /lt 416 def /lb 416 def /rt 416 def /rb 416 def /lk 416 def /rk 416 def /rc 416 def /lc 416 def /rf 416 def /lf 416 def /bv 416 def /o This smells like Postscript to me. I can't help but think it's no coincidence that recent file corruptions have happened on the same days that Allspice crashed. Hmmm, I see that Mint rebooted yesterday too. 1116. Date: Fri, 13 Apr 90 10:21:43 PDT From: tve (Thorsten von Eicken) Subject: Re: second allspice crash Some addtional info: I was working in a dir which got deleted (rm -rf) from another client (burble). 1117. Date: Fri, 13 Apr 90 11:30:01 PDT From: tve (Thorsten von Eicken) Subject: chksum on /mic detects three corruptions! Return-Path: daemon Received: by sprite.Berkeley.EDU (5.59/1.29) id AA200284; Fri, 13 Apr 90 11:14:40 PDT Date: Fri, 13 Apr 90 11:14:40 PDT From: root (The Sprite God) Message-Id: <9004131814.AA200284@sprite.Berkeley.EDU> To: tve Subject: Checksum run for /mic Checksum started at Fri Apr 13 10:23:58 PDT 1990 Running on allspice.Berkeley.EDU ./guest/casotto/projects/vov/develop/trace/vovlib.h corrupted: id 148410 mtime 260812b6 old d3f5becf new 16d0e096 ./octtools/src/lib/ace/RCS/apGenerate.c,v corrupted: id 116693 mtime 25b121d5 old 7721ed2d new 21c65c8e ./bric/doc/.std.dvips.external corrupted: id 113955 mtime 26052883 old a4028bcb new e16d8748 3 errors found Checksum completed at Fri Apr 13 11:14:36 PDT 1990 I haven't looked at them yet... 1118. Date: Fri, 13 Apr 90 11:36:23 PDT From: Fred Douglis <douglis> Subject: Re: chksum on /mic detects three corruptions! out of curiosity, i looked at /hosts/allspice/rsd02c.fsc to see if it mentioned these files. it didn't, but it mentioned a lot of other problems: rsd02c: Thu Apr 12 16:51:25 1990 rsd02c: "/mic" rsd02c: Indirect block 522120 of file 134326 contains garbage index 33686468 rsd02c: Found error in file descriptor bitmap rsd02c: File octtools/src/cmds/bolt/erwin/ds3100.md/dependencies.mk.bak~ references non-allocated descriptor 44806. File Deleted. rsd02c: Entry dependencies.mk.bak~ (4) now has nameLength 20, recordLength 428, fileNumber 0. rsd02c: File octtools/src/cmds/bolt/paul/tighten.c references non-allocated descriptor 60061. File Deleted. rsd02c: Entry tighten.c (6) now has nameLength 9, recordLength 20, fileNumber 0. rsd02c: Bad record length in directory. Directory entry deleted from octtools/src/cmds/bolt/paul/RCS/ rsd02c: . missing in directory 49632 octtools/src/cmds/bolt/paul/RCS/. Changed to a file. rsd02c: File . references non-allocated descriptor 49793. File Deleted. rsd02c: Entry . (0) now has nameLength 1, recordLength 12, fileNumber 0. rsd02c: . missing in directory 50317 . Changed to a file. rsd02c: File . references non-allocated descriptor 49792. File Deleted. rsd02c: Entry . (0) now has nameLength 1, recordLength 12, fileNumber 0. rsd02c: . missing in directory 105808 . Changed to a file. rsd02c: File . references non-allocated descriptor 889. File Deleted. rsd02c: Entry . (0) now has nameLength 1, recordLength 12, fileNumber 0. rsd02c: . missing in directory 113584 . Changed to a file. rsd02c: File . references non-allocated descriptor 888. File Deleted. rsd02c: Entry . (0) now has nameLength 1, recordLength 12, fileNumber 0. rsd02c: . missing in directory 113585 . Changed to a file. rsd02c: Bad record length in directory. Directory entry deleted from rsd02c: . missing in directory 113586 . Changed to a file. rsd02c: 26 unreferenced files rsd02c: 5 links counts corrected rsd02c: Found error in data block bitmap rsd02c: 42695 files, 540746 blocks in use, 84934 blocks free, 36392 fragments 1119. Date: Fri, 13 Apr 90 13:55:09 PDT From: culler (David Culler) Subject: problems with rdist When I run rdist from mammoth to update files on sprite, I find some strange behavior. Rdist discovers the files as out-of-date on sprite and claims to update it. However, no change occurs on the sprite side. If I run rdist again on mammoth, it things everything on sprite is up to date. 1120. Date: Fri, 13 Apr 90 14:01:45 PDT From: pmchen (Peter M. Chen) Subject: unable to read files For a short time, I was unable to read files in my home directory. The error message on my syslog was: Warning: VmFileServerRead: Error 5 from Fs_Read or Fs_PageRead Bad user TLB fault in process 32c41: pc=41bf84 addr=10003a80 Fs_PageRead: Read failed <5> Warning: VmFileServerRead: Error 5 from Fs_Read or Fs_PageRead Bad user TLB fault in process 22c2b: pc=416f7c addr=10003d80 Fs_PageRead: Read failed <5> Warning: VmFileServerRead: Error 5 from Fs_Read or Fs_PageRead Bad user TLB fault in process c2c0f: pc=404328 addr=100014a4 Windows are also dying on me left and right. If you want to look at the machine, CALL ME within 5 minutes, otherwise, I'll reboot the darn thing. This was on mustard, a ds3100. 1121. Date: 13 Apr 90 09:20:09 PDT (Friday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: second allspice crash Fsutil_HandleReleaseHdr et. al. only deal with local locking. I'm not sure what is meant by "a client had locked". A handle may have been locked during a service call on Allspice itself, and then (apparently) unlocked an extra time. Second, the "server hint" is a port number, not a hostID. At any rate, an errant RPC by a client should not cause a locking error in Fsutil_Handle*. These routines are for local manipulation of handles, and each RPC should be self-consistent - no RPC leaves a handle locked after completion and therefore no RPC is responsible for unlocking a handle that was locked by a different RPC. In the past this unlocking error has occured after a continued panic. There is some error case that unlocks a handle early, and then when the panic is continued the handle gets unlocked a second time. 1122. Date: Fri, 13 Apr 90 15:26:49 PDT From: tve (Thorsten von Eicken) Subject: spritemon -v% must be wrong on sun4 on crackle which has 16 Megs of main memory: spritemon -vM shows >16M (almost 17) spritemon -v% shows about 50% spritemopn -fM shows about 1Meg devoted to the fs cache Mhh, someone must have added memory to my machine.... 1123. Date: Fri, 13 Apr 90 17:16:47 PDT From: elm (ethan miller) Subject: sun4c kernel problem on joyride The following message is printed in an infinite loop. This started to happen soon after allspice recovered today (Fri 13 Apr about 4PM). Entering debugger with an Interrupt Trap (16) exception at PC 0xf608e6a8 Fatal Error: Deadlock!!!(Proc:serverMutex @ f6130f88) Holder PC: 0xf60760f4 Current PC: 0xf6076094 Holder PCB @ 0xffffffff Current PCB @ 0xffffffff After talking to Fred, I decided to reboot the machine. 1124. Date: Fri, 13 Apr 90 17:41:54 PDT From: shirriff (Ken Shirriff) Subject: Latest allspice crash Allspice went into the debugger with: Fscache_OkToScavenge: FSCACHE_FILE_BEING_WRITTEN (continuable) I couldn't figure out what was going on and I continued it. 1125. Date: Fri, 13 Apr 90 18:21:21 PDT From: shirriff (Ken Shirriff) Subject: Allspice crash. Allspice crashed shortly after my previous message, presumably because the continue didn't work. It died with: Fscache_DeleteFile failed "151" blocks 0 flags 880 Entering debugger Sprite is now detached from the debugger FsGetDirtyFile skipping deleted file <1,33785> "no name" [a bunch more FsGetDirtyFile messages ] MachHandleWeirdoInstruction: unaligned address trap in the kernel procPtr = f62ca100, pc=f605d500 The stack trace was MachReturnFromTrap() Sync_GetLock() GetDirtyFile() It died in: List_Remove((List_Links *)cacheInfoPtr); in GetDirtyFile. I don't know what cacheInfoPtr is because I didn't realize that's where it died until it came back up and I could check the source. 1126. Date: Sat, 14 Apr 90 09:58:20 PDT From: ouster (John Ousterhout) Subject: More corruption The file /sprite/users/david/sequent/patpg/sun3.md/md.mk was corrupted yesterday. Its tail now consists of information from a Mx log file. Has there been a change of kernels in the last few days? I'm surprised that we're suddenly seeing a jump in file corruptions. 1127. Date: Sat, 14 Apr 90 14:25:57 PDT From: douglis (Fred Douglis) Subject: plumbing problem allspice hung up for a very long time, long enough to make us think it wasn't just a question of the disk getting backed up. however, it appears it cleared up on its own just as i was setting up to debug it, and i just didn't realize it in time. when i backtraced all the relevant processes, i found them all waiting on disk activity (Dev_BlockDeviceIOSync). i also managed to make allspice panic as i backtraced one process, for no apparent reason, so i had to reboot rather than continue. 1128. Date: Mon, 16 Apr 90 11:10:49 PDT From: Fred Douglis <douglis> Subject: ipServer dying on ds3100 my ipServer has died twice in the past 3 days. i wasn't able to determine its state the first time (i wasn't here when it happened), but this time i saw a message about a reserved instruction and dbx showed an empty stack with the PC in hyperspace. apologies for the "ipServer on kvetching died and was restarted" message. it's from the fixIPServer script. i'll change the script to do it only if given an argument, which crontab for the servers can do. 1129. Date: 16 Apr 90 09:46:34 PDT (Monday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: Latest allspice crash A file goes through two states on its way to disk. FSCACHE_FILE_ON_DIRTY_LIST is the first state, and FSCACHE_FILE_BEING_WRITTEN is the second. The scavenge procedure found a file in the second state. I think this is ok, actually, and the scavenger should just skip over the file. I'd change Fscache_OkToScavenge so that ok = (numBlocks == 0) && ((cacheInfoPtr->flags & FSCACHE_FILE_ON_DIRTY_LIST) == 0); is updated to be: ok = (numBlocks == 0) && ((cacheInfoPtr->flags & FSCACHE_FILE_ON_DIRTY_LIST) == 0) && ((cacheInfoPtr->flags & FSCACHE_FILE_BEING_WRITTEN) == 0); And then nuke the panic that follows. 1130. Date: 16 Apr 90 09:49:54 PDT (Monday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: Allspice crash. Ah ha. So Allspice dies anyway, even if you skip over a file that is FSCACHE_FILE_BEING_WRITEN. Someone needs to figure out how files go from BEING_WRITTEN to be back on the FILE_ON_DIRTY_LIST and fix the case where the FSCACHE_FILE_GONE flag is set (i.e, the file is deleted). 1131. Date: Mon, 16 Apr 90 15:24:19 PDT From: pmchen (Peter M. Chen) Subject: /user3 I can't ls /user3, except when I'm in /user3. mustard% cd mustard% pwd /user4/pmchen mustard% prefix Prefix Server Domain File # Version / mint 0 2 1 imported /sprite mint 1 2 1 imported /swap1 allspice 1 2 1 imported /user3 king 3 2 1 imported /user4 assault 9 2 1 imported /mic allspice 3 2 1 imported /user1 allspice 2 2 1 imported /tmp oregano 3 55584 1 imported /scratch oregano 4 2 1 imported /user2 assault 0 2 1 imported /c oregano 3 2 1 imported /sprite2 oregano 782 2 0 imported /sprite/src allspice 7 2 1 imported /dist assault 2 2 1 imported /sprite/src/kernel allspice 6 2 1 imported /X11 allspice 9 2 1 imported /spur2 oregano 780 2 0 imported mustard% ls /user3 /user3 not found mustard% cd /user3 mustard% pwd /user3 mustard% ls jhh/ lost+found/ lutz/ ss/ mustard% cd mustard% pwd /user4/pmchen mustard% ls /user3 /user3 not found 1132. Date: 16 Apr 90 17:49:22 PDT (Monday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Sprite at PARC Here is what I've hit so far when trying to bring Sprite up. I'm not there yet, so this is a running list. Regarding the README file: I've already split this into a README.sunos and a README.ultrix There is duplication between the two, but I figured that its still better to isolate system dependencies. It looks like the single README file has just been edited (and not completely) depending on what's being done. There is still a reference to "dev_file" that should be "devFile". There is no mention of what to do if you don't have a name server. I'm putting /dev/null into /etc/resolve.conf for now. About fsinstall: It stops dead if it finds a file it can't read. This happens in an NFS environment (mine, anyway) if you are running as root on the workstation (because you need to write the disk) and you hit an NFS file that is read-only to the owner. In NFS-land a remote root process can't even read that kind of file, and fsinstall stops. So far this has found sprite/lib/xrn/rn.bindings sprite/daemons.sun3/arp.config.notused The kicker (or killer) is that fsinstall dumped core near the end, on or about sprite/boot/sun3.md/sprite (the kernel image) Source code would be good. I tried booting anyway because it looks like I've got most of the commands, plus boot/bootcmds. However,the kernel I have doesn't look for a root on /dev/rxy0c, which applies to me. I'll have to gen up a kernel at Berkeley that looks for this disk. Finally, if a Sprite host can't find a file system you have to power-cycle it. As my machine is downstairs that's sort of a pain. The ttyDriver needs to enable the L1 key processing much sooner. That's all for now. Is there a better fsinstall I can try? 1133. Date: Tue, 17 Apr 90 10:29:47 PDT From: Fred Douglis <douglis> Subject: pmake bug: trailing blank included in variable in release 2.1 (beta?): one of our users couldn't understand why pmake was trying to link *.po and *.o into a single executable, in one particular directory. It turned out he had PROFSUFFIX = .pg NAME = foo ^--- note trailing space profile : %(TM).md/%(NAME)%(PROFSUFFIX) %(TM).md/%(NAME)%(PROFSUFFIX) : %(OBJS:S/.o%/.po/g) %(LIBS:S/.a%/_p.a/g) and this came out as ds3100.md/foo .pg : ...... instead of ds3100.md/foo.pg : ...... ... So, is it intentional for blanks to be included at the end of variables? i guess you might want that, under some circumstances, but then a warning message would certainly be useful... 1134. Date: Tue, 17 Apr 90 17:29:53 PDT From: tve (Thorsten von Eicken) Subject: gdb/tx locks up I have a program which uses sockets and has a bug (obviously..). When I try to debug it, things look fine at first, but soon sprite starts playing tricks on me. In particular, the process I'm debugging and gdb become unkillable (i.e. a kill -KILL hangs), and the tx freezes as soon as I hit control-C or thelike. Mendel suggested that something got locked in the kernel dues to recovery or who knows what. I rebooted, but the problem reappears quickly. To reproduce: (on crackle, a sun4 running 1.063) cd ~casotto/projects/vov/develop/shell gdb vov_sh [takes a while] run -I [takes a while, the process starts, nothing happens, wait ~10 seconds] ctrl-C [gdb says hello again] run -I [respond yes to the question, gdb says "starting program, but nothing further happens, now hit ctrl-C and you're dead] This is the quickest way to get the problem, many other roads seem to lead to the same pit... 1135. Date: Wed, 18 Apr 90 08:46:32 PDT From: ouster (John Ousterhout) Subject: More corruptions The following files were found to be corrupted yesterday: /sprite/users/alc/tests/nocachetests/results.nc3 /sprite/users/hilfingr/mp/enbsigfifo.o The following files were found to be corrupted today: /sprite/users/david/sequent/patpg/sun3.md/md.mk 1136. Date: Wed, 18 Apr 90 10:16:32 PDT From: culler (David Culler) Subject: DBX What is the chance of getting a working debugger on the DS3100? I am having lots of new (and some old) problems with dbx. Step and Next often do not work --- dbx complains about illegal instructions and syslog says Bogus bp-trap. Also, if you let the help run to the end, dbx hangs. Sometimes it hangs when you ctrl-c out of the help before the end. 1137. Date: Wed, 18 Apr 90 13:06:39 PDT From: tve (Thorsten von Eicken) Subject: more corruptions (on /mic) ------- Forwarded Message Return-Path: daemon Received: by sprite.Berkeley.EDU (5.59/1.29) id AA794713; Wed, 18 Apr 90 06:28:15 PDT Date: Wed, 18 Apr 90 06:28:15 PDT From: root (The Sprite God) Message-Id: <9004181328.AA794713@sprite.Berkeley.EDU> To: tve Subject: Checksum run for /mic Checksum started at Wed Apr 18 05:00:11 PDT 1990 Running on mint.Berkeley.EDU ./guest/casotto/projects/vov/develop/trace/vovlib.h corrupted: id 148410 mtime 260812b6 old 16d0e096 new d3f5becf 1 errors found Checksum completed at Wed Apr 18 06:28:09 PDT 1990 ------- End of Forwarded Message ------- Forwarded Message Return-Path: daemon Received: by sprite.Berkeley.EDU (5.59/1.29) id AA794713; Wed, 18 Apr 90 06:28:15 PDT Date: Wed, 18 Apr 90 06:28:15 PDT From: root (The Sprite God) Message-Id: <9004181328.AA794713@sprite.Berkeley.EDU> To: tve Subject: Checksum run for /mic Checksum started at Wed Apr 18 05:00:11 PDT 1990 Running on mint.Berkeley.EDU ./guest/casotto/projects/vov/develop/trace/vovlib.h corrupted: id 148410 mtime 260812b6 old 16d0e096 new d3f5becf 1 errors found Checksum completed at Wed Apr 18 06:28:09 PDT 1990 ------- End of Forwarded Message 1138. Date: Wed, 18 Apr 90 13:42:33 PDT From: mendel (Mendel Rosenblum) Subject: cache bug A couple of problems in the cache code: When the Vm_CopyIn in Fscache_Write fails (for example if the user passes a bogus pointer or buffer length to the write() system call), the code deletes the cache block being writtin. This can cause data loss if the cache block contains delayed write data not yet written to disk. Note that the lost changes could belong to another process. The cacheInfo data structure is modifed that the routines in fsCacheOps.c and those in fsBlockCache.c. Unfortunately, the differ files contain different MONITOR_LOCKs. The fsCacheOps.c routines lock cacheInfoPtr->lock while those in fsBlockCache.c use cacheLock. On a multiprocessor, two processors could attempt read/modifiy/write instruction sequences at the same time on this data structure. 1139. Date: 18 Apr 90 15:38:12 PDT (Wednesday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: cache bug About cache locking. My goal was that the routines in fsCacheOps.c grab the per-file monitor lock and then call into the routines in fsBlockCache.c. That latter grabs a global cache lock. So, when a routine in fsBlockCache.c modifies a cacheInfo structure it should already be locked. This may or may not be true in all cases - background processing comes to mind. 1140. Date: Wed, 18 Apr 90 15:48:45 PDT From: tve (Thorsten von Eicken) Subject: big debugging problems on sun4 More of the style as I reported yesterday (subject was: gdb/tx locks up). Does gdb run at all????? 1141. Date: Wed, 18 Apr 90 19:33:40 PDT From: mgbaker (Mary Gray Baker) Subject: funny date I just recompiled sys/{sun4c,sun4,sun3}.md/sys.o and here are the dates of the respective machine type binaries: ls -lt *.md/sys.o -rwxrwxr-x 1 mgbaker 159603 Apr 18 1990 sun3.md/sys.o* -rwxrwxr-x 1 mgbaker 144228 Apr 18 19:28 sun4.md/sys.o* -rwxrwxr-x 1 mgbaker 142548 Apr 18 19:27 sun4c.md/sys.o* -rw-rw-r-- 1 mgbaker 115728 Apr 18 17:35 ds3100.md/sys.o -rw-rw-r-- 1 jhh 154146 Mar 30 12:44 spur.md/sys.o -rw-rw-r-- 1 douglis 109188 Oct 31 13:57 cleands3100.md/sys.o The sun3.md/sys.o looks funny to me. 1142. Date: Thu, 19 Apr 90 09:14:42 PDT From: ouster (John Ousterhout) Subject: Allspice crash When I came in this morning Allspice was dead, with our old familiar friend, "Fscache_DeleteFile failed...". I have a very difficult time rebooting it, because the kernel "sun4.md/new" (1.064) didn't boot: it hung just after printing the line "IE-0 net interface at ...". After trying various things like power-cycling the machine, I eventually gave up and tried the default kernel (1.063), which worked. Did someone install an untested .new kernel recently? 1143. Date: Thu, 19 Apr 90 13:03:06 PDT From: mgbaker (Mary Gray Baker) Subject: Re: Allspice crash I installed a new new kernel yesterday. Things seemed to work on all the machine types when I linked it from my home directory. I then installed it and then tested it again on 2 out of the 4 machine types. In the meantime, something bad must have happened to the rpc module, because to fix the problem, I just recompiled the rpc module. Nothing had been edited but the binary was a different size. I saved the binary to see if it got corrupted somehow. I just reinstalled the new new kernel which boots fine on anise. I'm sorry about the trouble, but I did try testing the stuff! 1144. Date: Thu, 19 Apr 90 13:05:09 PDT From: shirriff (Ken Shirriff) Subject: Re: Allspice crash Allspice crashed around 12:15. It was running the 1.063 kernel. The crash was: MachHandleWeirdoInstruction: unaligned address trap in the kernel Fatal Error: MachHandleWeirdoInstruction: the error occured in a kernel proc with procPtr=f66c4de0 and pc=f604770c Rosemary was down for disk work, so I couldn't debug it. 1145. Date: Thu, 19 Apr 90 13:58:40 PDT From: mgbaker (Mary Gray Baker) Subject: Re: Allspice crash I just ran the debugger on 1.063 to see where allspice crashed. It crashed in FslclLookup at line 342 where it indirects through curHandlePtr and curHandlePtr->descPtr: /* * At this point we have a locked handle on the current point * in the lookup, and perhaps have a locked handle on the parent. * Links are expanded now so we know whether or not the * lookup is completed. On the last component, we only * expand the link if the FS_FOLLOW flag is present. */ if ((status == SUCCESS) && ((*curCharPtr != '\0') || (useFlags & FS_FOLLOW)) && ((curHandlePtr->descPtr->fileType == FS_SYMBOLIC_LINK || CurHandlePtr->descPtr->fileType == FS_REMOTE_LINK))) { Have we ever had a problem with one of these being NIL or otherwise bogus before? 1146. Date: Thu, 19 Apr 90 14:32:25 PDT From: schauser (Klaus Erik Schauser) Subject: bus error I very often get the following severe error on paprika (sun 3/75). Entering debugger with a bus error exeption at pc 0xe06d1ac It seems to happen when I work with emacs under xwindows. It happens about once a day, afterwards everything is dead, so I need to reboot. When rebooting, the boot process sometimes stops after the line using IP adress .... But trying to boot for the second time usually helps. Please keep me informed. 1147. Date: Thu, 19 Apr 90 15:25:02 PDT From: elm (ethan miller) Subject: highlight problem in twm, X11R4 When I use twm under X11R4, the windows that are supposed to be highlighted by a border change when I move the mouse aren't highlighted. The X11R3 version of twm changed the border to (in my case) red whenever input was being sent to that window. This is not a serious bug, but it would be nice if someone looked at it if they had the chance (or told me where to look). All of this occurs on my SparcStation (terrorism). 1148. Date: Thu, 19 Apr 90 16:03:29 PDT From: tve (Thorsten von Eicken) Subject: Re: highlight problem in twm, X11R4 Yes, I get that too. I seem to remember that it's a bug for which a fix is floating around. Due to disk space limitations, it's hard to apply the fixes now. Please hang on.... 1149. Date: Thu, 19 Apr 90 16:52:57 PDT From: tve (Thorsten von Eicken) Subject: last allspice crash It seems allspice crashed between yesterday evening and this afternoon. None of the machines in 444 recovered in any way: they were all completely dead. Dunno whether that means anything... 1150. Date: Thu, 19 Apr 90 18:19:22 PDT From: rab (Robert A. Bruce) Subject: Allspice crash Allspice crashed. It paniced in Fscache_DeleteFile, line 1372 /* * At this point the file should have no cache blocks associated * with it, clean or dirty, and the file itself should not be * on the dirty list or being written out. */ if ((cacheInfoPtr->blocksInCache > 0) || (cacheInfoPtr->flags & (FSCACHE_FILE_ON_DIRTY_LIST| FSCACHE_FILE_BEING_WRITTEN))) { panic("Fscache_DeleteFile failed \"%s\" blocks %d flags %x\n", Fsutil_HandleName(cacheInfoPtr->hdrPtr), cacheInfoPtr->blocksInCache, cacheInfoPtr->flags); } cacheInfoPtr->blocksInCache was zero, but cacheInfoPtr->flags was 0x0880, which is (FSCACHE_FILE_GONE | FSCACHE_FILE_ON_DIRTY_LIST). 1151. Date: Thu, 19 Apr 90 19:42:37 PDT From: brent (Brent Welch) Subject: distribution bugs Sprite is sort of up at PARC. Along the way I've noted the following problems with the distribution and/or Sprite itself. 1) (We know about this...) You can't really use a 'c' partition for a file system unless the 'a' partition is of equal size. No wait, actually if the table in devConfig.c is set up to look only for the 'c' partition (unit 2) then the attach of the 'c' partition will work. However, if the table looks for 'a', then the switch-over to partition 'c' (based on header info) is too late. The driver already thinks the 'a' partition is corresponds to the file system and you can't access most of the 'c' partition. I had to resort to either patching my kernel's table or changing the disk label so 'a' == 'c'. 2) There is no xy0c in the devFile, so this device isn't created. Of course this is the device that I named in my mount table... I fixed my devFile (the file with the list of devices to create) to include all the xy0 partitions (a through h). I can't think of a good reason to leave any out. 3) loadavg is run with the wrong arguments in bootcmds. It should be replaced all together wiht a call to /sprite/daemons/migd. 4) I have to boot "ie() -a" to net-boot my machine. initsprite doesn't grok the -a argument and is noisey about it. 5) Finally, and perhaps most seriously, I think the formatting done by fsinstall/fsmake for the non-scsi disks isn't right. It doesn't handle drives with reserved sectors on each track. My disk has 67 sectors/track, but it is formatted so that only 64 sectors appear and the rest are reserved for 'slip-sector' handling of bad spots on the disk. fsmake uses the raw values of the sectors/track that it gets from the disk label. I patched around this by using dbx to reset the value to 64 before it laid out the file system. Unfortunately the only way to determine the number of formatted sectors is to do a low-level read of a track and count up logical sectors. 6) fsinstall doesn't create a lost+found directory if one isn't in the directory structure to copy. It should. 7) It appears that fsinstall doesn't write-out the root directory until its all done. At least if it bombs out part-way through the root directory only has "." and ".." in it. I had to figure that out by running the kernel debugger, which seems to work just fine over here. 1152. Date: 19 Apr 90 16:48:36 PDT (Thursday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: Allspice crash Perhaps the descPtr was NIL. This happens when a handle` gets removed too soon. 1153. Date: Fri, 20 Apr 90 12:45:25 PDT From: tve (Thorsten von Eicken) Subject: nfsmount in DEBUG root c263c 0.0 21.2 5584 3464 DEBUG 2:57 nfsmount eros:/octtools /eros/octtools As you can see, it got quite large (5Megs), and that in about 20 minutes of use! Is anyone looking at it? Spring cleaning? SUmmer cleaning? 1154. Date: Fri, 20 Apr 90 14:36:17 PDT From: tve (Thorsten von Eicken) Subject: entertaining recovery with oregano At first crackle didn't want to recover, I had to "play around a bit" typing commands etc... Then the syslog went wild scrolling the following: [crackle tve] Fsprefix_HandleClose nuking "/c" Broadcasting for server of "/c" 4/20/90 14:33:09 oregano (38) - recovering handles Fsprefix_HandleClose nuking "/tmp" Importing "/c" from oregano .... Fsprefix_OpenCheck waiting for recovery Fsprefix_OpenCheck ok Fsprefix_OpenCheck ok Fsprefix_OpenCheck ok LE ethernet: Received packet with overflow error. 1155. Date: Fri, 20 Apr 90 14:59:10 PDT From: Fred Douglis <douglis> Subject: Re: entertaining recovery with oregano i should point out that the repeating recovery is due to a timeout on a pair of prefix RPCs that are going to oregano. why it's doing a prefix rpc directly to oregano, and why that rpc is timing out, is beyond me. i did notice that almost all of oregano's daemons, esp. nfsmount and inetd, disappeared earlier. i had a prefix entry for /envy/usr on oregano, but even deleting that prefix didn't cause the timeouts to stop. 1156. Date: Fri, 20 Apr 90 16:26:40 PDT From: mgbaker (Mary Gray Baker) Subject: Re: entertaining recovery with oregano I had the same experience with a prefix on assault yesterday. I was unable to touch /rosemary/tmp without my machine going crazy. I'll try to figure out what's going on. 1157. Date: Fri, 20 Apr 90 19:09:03 PDT From: mgbaker (Mary Gray Baker) Subject: Horrible time with mint I got back from aerobics and mint had crashed, apparently while it had been rebooting from another crash. It said a bunch of nasty things on its console. There were about 5 or 6 rpc version mismatch messages with bad client and srvr fields: Version mismatch clt 175 srv 329 file "noname" from client 25 Then it said MEMORY ERROR! Status DF, DVMA-BIT 0, Context 0, Vaddr: E1E5330, Paddr: 001F3330, Type 0 at 0E020CA2 Break FFFF at 0E020CA0 > Warning: Intel: Bus error on chip And then it had another memory error with the number E020E4A instead, followed by a break and a prom prompt and an Intel bus error. I hit return on the console and it printed out 3 more version mismatches! Then it got more memory errors, etc. Then it managed to sync its disks and put itself in the debugger due to the current process being NIL. Everything was such a mess that I didn't think it could be worthwhile to debug it and I decided to reset it to see if that helped. I did a k1 but a k2 froze. I had to power cycle it, twice. I finally got it to reboot, with a few version mismatches and an all-time winner of a recovery storm. Unfortunately from the point of view of debugging our recent recovery problems, all the machines with recovery tracing appeared to recover just fine. At least the air in the machine room hadn't been reached by the evening stink bomb, so I was able to breathe well for a while. 1158. Date: Mon, 23 Apr 90 08:23:43 PDT From: ouster (John Ousterhout) Subject: rsh to allspice broken Rsh doesn't seem to work to Allspice. When I try it from Tyranny (under my account) or from Mint (under the root account) I get "allspice.Berkeley.EDU: connection refused". I know this used to work, because the checksum program uses it; it stopped working around the middle of last week. 1159. Date: Fri, 20 Apr 90 21:02:13 EDT From: douglis@piquante.Berkeley.EDU (Fred Douglis) Subject: mint was ailing. so is ginger. 1) mint died with a negative reference count. it looks like Fsconsist_Kill set the reference count on a client structure to all 0's but then Fsconsist_IOClientClose decremented it to -1. This was after mint got a stale handle status back from host 56 after it rebooted. 2) mint was continued okay from this state, but wedged up during recovery. it printed various messages about files "alc" and "shirriff" with different clients having dirty blocks -- check those mail files! it wouldn't respond to break-d or break -anything else, complaining "non-zero character on serialB" or something to that effect. 3) ginger's console is totally unusable because it is hung on an NFS mount to mint, of all places. anyone know who might have nfs-mounted mint to ginger (a hard mount to boot!)? bks said this happened a week ago too. logins to ginger wedge up as well -- i could run kmsg only by logging in remotely as root. 1160. Date: Mon, 23 Apr 90 10:35:49 PDT From: Fred Douglis <douglis> Subject: allspice daemons its inetd was running but was not working properly, and its sendmail had disappeared completely. this is similar to what has happened recently with oregano, the difference being that it's necessary to log in to allspice directly to kill its ipServer (whereas i could migrate onto oregano enough to get the processID and kill it remotely). 1161. Date: Mon, 23 Apr 90 17:36:30 PDT From: sequent!fubar@uunet.uu.net Subject: /sprite/cmds/tape write eof buglet /sprite/src/cmds/tape/tape.c as distributed has: [...the beginning of the file...] int skipFiles = 0; int skipBlocks = 0; int writeIt = 0; int blockSize = 16 * 1024; int weof = 0; Option optionArray[] = { { OPT_STRING, "t", (Address)&tapeFile, "Name of tape device" }, { OPT_TRUE, "r", (Address)&rewindIt, "Rewind the tape" }, { OPT_TRUE, "T", (Address)&retension, "Retension the tape" }, { OPT_TRUE, "e", (Address)&gotoEnd, "Skip to the end of the tape" }, { OPT_INT, "f", (Address)&skipFiles, "Number of tape files to skip" }, { OPT_INT, "b", (Address)&skipBlocks, "Number of blocks to skip" }, { OPT_INT, "m", (Address)&skipBlocks, "Number of end-of-file marks to write" }, { OPT_INT, "B", (Address)&blockSize, "Block size" }, [...the rest of the file...] Note the arg to the "m" option. This should probably be: { OPT_INT, "m", (Address)&weof, "Number of end-of-file marks to write" }, 1162. Date: Mon, 23 Apr 90 17:47:15 PDT From: elm (ethan miller) Subject: funny recovery state I get the following message sequence, in an infinite loop, in the syslog: 4/23/90 17:37:46 allspice (14) - recovering handles 4/23/90 17:37:46 allspice (14) Recovery complete 104 handles reopened 4/23/90 17:37:46 allspice (14) Fs_PageCopy, waiting for server %d 4/23/90 17:37:46 allspice (14) RmtFile "43" <1,38574> : stale handle <prefix> 4/23/90 17:37:52 broadcast (0) RPC timed-out This is on terrorism, and it occurred after allspice's crash and recovery on Monday afternoon. At the time of the crash, the shell in question (tcsh) was trying to execute an ls -l. I don't know which directory was being listed, but I know it was a native Sprite (not NFS) directory. I am rebooting terrorism. 1163. Date: 23 Apr 90 18:10:03 PDT (Monday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: funny recovery state Sounds like Fs_PageCopy needs to guard against an invalid handle (Fsutil_HandleInvalid ?) caused by failed recovery. 1164. Date: Tue, 24 Apr 90 13:12:36 PDT From: shirriff (Ken Shirriff) Subject: Allspice crashed Allspice crashed with: MachHandleWeirdoInstruction: unaligned address trap procPtr = f681c4b8, pc=f605d390 Entering debugger TI TI TI TI TI [repeat from MachHandleWeirdoInstruction...] After a bunch of this it did a watchdog reset, so I rebooted it. 1165. Date: Tue, 24 Apr 90 15:57:48 PDT From: eklee (Edward K. Lee) Subject: trashed file The file ~eklee/raidSim3/test.L/test.8/tmult.out has been trashed sometime in the last 5 or ten minutes (that's when I created the file). I've moved it to ~eklee/trashed/tmult.out in case someone wants to look at it. The file now consists of fragments of other peoples mail messages. Luckily, it's a machine generated file that is easy to replace. 1166. Date: Tue, 24 Apr 90 16:08:23 PDT From: Fred Douglis <douglis> Subject: Re: trashed file both of the mail fragments in ed's trashed files are copies of *outgoing* mail from my account. i use MH to send mail, so drafts go in /user2/douglis/Mail/drafts/...., while ed's file was also on assault in /user4. both of my messages were created in the few minutes preceeding ed's message. in fact, i can date the second message as 15:49 and the first one only a few minutes before that. assault did not crash during that time, but i did recovery w/ oregano a few times in there. for whatever that's worth. the first draft was contained in ed's file in its entirety, from offset 0 to 0134. the second was partial (maybe from an autosave file?), from offset 010000 to 010330. there were nulls in between and following the second draft. 1167. Date: Tue, 24 Apr 90 17:14:20 PDT From: tve (Thorsten von Eicken) Subject: rcsmerge doesn't work In particular, it calls merge, which calls diff3 giving it 5 filenames. Diff3 turns around and says it accepts only 3 filenames. 1168. From: mtxinu!uunet.uu.net!myrias!alberta!anthony@ucbvax.Berkeley.EDU Date: Tue, 24 Apr 90 14:59:42 -0600 Subject: Status This message is a report on our sprite status, (things are not too good). 1) XSprite still does not work. After much looking around and ftping sources I managed to compile a debuggable version. It turns out that the server was allocating pixmap memory at virtual address 0x80000 which when the program tried to write this address when painting the gray screen background would die with a segmentation violation. This happened in Xsprite(Xmfb)/mfbpntwin.c in function mfbPaintWindow (I do not have the line number with me at the moment but will get it to you if it is necessary at a later date). 2) I left the Xsprite problem for a while to work on trying to get our new hard disk working. All attempts at using mkscsidev/update, fsmake/fsinstall, fsmake/tar, fsmake/dump/restore failed. So I looked into the source for fsinstall, made a few modifications (two patches and one bug fix), and by installing a disktab entry for our new disk in /etc/disktab we managed to get it to install all the files from our small disk onto our our new disk, with the exception of pdevs and regular devs. The pdevs and dev. had to be recreated (some by hand). One of the problems I faced in patching fsinstall was making the binary. I kept getting errors when I compiled it locally, so I changed the source locally, and built it at "murder". Well after everything got setup we booted off the new disk with a few problems, I do not recall which, but managed to fix them, one of the problems was the all -heapSize 1000000 command in the mount file; it was too small, so we removed the -heap limitation (Booting now is quit slow, since the fscheck reads the whole disk). (The new disk 500Mbytes, while the old one was 120Mbytes). Things looked good for a while, until I started getting BlockIOProc: messages on the console. (From kernel/dev/devSCSIDisk.c:47). THe compliate was about some sector number being larger than an other. Well at the time it did not seem to affect much. The error came when ever the cpu/disk where fairly busy, like in a compilation of a large piece of software (I was remake the Xsprite software at the time). I got a few other errors from the file system. One of the other errors was "FsioVerifyBlockWrite: disk block mismatch..." This one was terminal. It seems that a swap file was corrupted. To fix this one I removed the directory (/swap/1/132) and rebooted the machine. I also got a Fsdm error, but I did not make any record of it. One of the things I noticed with the BlockIOProc errors was it heralded the magic corruption of a users file, after which that file would always be overwritten with trash by some wonderful invisible imp. Now I had plans to look at the BlockIOProc errors to try and determine there cause, but I was under pressure to bring X up on my machine, so I ignored the problem (Bad error) and proceeded to ftp sources from the /X11R3 directories. (At this point I would like to put in a plug for a /X11R3 tape). Well today in attempting to compile Xsp I ran into a large number of BlockIOProc errors when the make reached the "cfb" subdir, together with a number of fsdm type errors (do not recall, said something about attach and File ID <1, 1, 175>). Well to cut a long story short, somehow, somewhere I managed to trash the / dir so bad that the fs sys was unable to write anything. Reads at the time would work, for the current dir etc. I tried to sync the disk, but to no avail, so in frustration and haste (another bad move) I aborted the machine and tried to reboot. Well Now the / prefix will not come on line since it is corrupted. The boot phase gets to Executing diskcmds. and comes up with the error "Corrupted directory? File ID <1, 1, 175> dirBlockNum<0>, blockOffset <0>" The boot programs then close the "/" and broadcasts for the / server, which it will not find since I am the only sprite site still. (Figure now in retrospect I should have tried to do a fscheck with fixRoot, before aborting). Now the name of the game is to try and recover from this drastic error with out setting me back weeks. Since our sprite site has never been fully operational we have not had backup services being done, so I could lose a whole hog of work. Now these disk errors where not present in our small disk sprite operations, so they have to be the result of the new disk, but I have idea where why how. In site from knowedgable spriters would be useful. Well for now I going to try to see if I boot of the old disk maybe I can fix the root of the new disk with fscheck. (maybe, :-)) As for the fsinstall, I think if you look at the file ~anthony/src/fsinstall/fsmake.c you should be able to get a diff with the regular fsmake.c to highlight the diffs. Just as a reminder we have a sun 3/60 (output from the Sys_getArch... call is Arch=3 Type=0x17) 500Mbyte hard disk (a temporary 120Mbyte hard disk with our original sprite setup). The machines name is swalwell.cs.ualberta.ca 129.128.4.26 and if it is up is available on Internet. I have lead to belive that the host entry tables at UBC (British Columbia) have been updated to point to our name server 129.128.4.241. And an account with id rab passwd ? (if you have not changed it see prio mail I cannot remember what it was). My work on the Virtual Machine has been stalled by the difficulties I have been having, but I really have hopes of getting it done if I can get past these troubles. I know this report is not all together complete, but I thought I should send something while I wait for the sys. people to come fix up my disk setup. 1169. Date: 25 Apr 90 13:11:45 PDT (Wednesday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: allspice crash I know this is a patch, but it looks like it would be safe to simply remove the cacheInfoPtr from the dirty-file-list where there current panic kicks in. The file is marked FSCACHE_FILE_GONE, so no I/O operations will do anything, and all the block lists for the file are empty, so no blocks would be abandoned. The fix, of course, is to determine who or how the file is being put back onto the dirty list. Mendel, do you know of a good reason why this won't work? 1170. Date: Wed, 25 Apr 90 14:38:42 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: babylon crash Babylon had a process that was locked, so that anything that touched it would freeze up, like "ps". The process was in the middle of being migrated (Proc_MigrateTrap) and was trying to free a page. The page had the following flags set: VM_DONT_FREE_UNTIL_CLEAN, VM_SEG_PAGEOUT_WAIT, VM_DIRTY_PAGE, and VM_FREE_PAGE. The process was waiting because of the VM_DONT_FREE_UNTIL_CLEAN. I could not find an entry on the timer queue corresponding to a call to PageOut nor could I find a Proc_ServerProc that was handling it already. Evidently it was lost. The kernel was ds3100.KS.079, which is comprised of all uninstalled modules. Here is a backtrace of the process: > 0 Mach_ContextSwitch(0xc01efd24, 0xfffff, 0x800b7a70, 0x8013a218, 0x800b8920) ["ds3100.md/machAsm.s":929, 0x80032ff8] 1 Sched_ContextSwitchInt(state = PROC_WAITING) ["schedule.c":434, 0x800b4668] 2 SyncEventWaitInt(event = 2149077728, wakeIfSignal = 0) ["syncLock.c":673, 0x800b891c] 3 Sync_SlowWait(conditionPtr = 0x801852e0, lockPtr = 0x8013a218, wakeIfSignal = 0) ["syncLock.c":283, 0x800b7a7c] 4 VmPageFreeInt(pfNum = 2148770328) ["vmPage.c":1280, 0x800c5750] 5 VmPageFree(pfNum = 3916) ["vmPage.c":1316, 0x800c57d0] 6 FreePages(segPtr = 0x800eb33c) ["vmMigrate.c":729, 0x800c4294] 7 Vm_EncapState(procPtr = 0xc01ee2e4, hostID = 60, infoPtr = 0xc12efd00, bufferPtr = 0xc021cbf4 = "") ["vmMigrate.c":168, 0x800c3748] 8 Proc_MigrateTrap(procPtr = 0xc01ee2e4) ["procMigrate.c":590, 0x8009f078] 9 Sig_Handle(procPtr = (nil), sigStackPtr = 0xc12efe2c, pcPtr = 0xc12efe28) ["signals.c":1193, 0x800b72f0] 10 .block15 ["ds3100.md/machCode.c":1275, 0x80034dc4] 11 MachUserReturn(procPtr = 0xc01ee2e4) ["ds3100.md/machCode.c":1275, 0x80034dc4] 12 MachSysCall(0x0, 0x1, 0x7ddffc80, 0x7ddffc7c, 0x7ddffc78) ["ds3100.md/machAsm.s":1531, 0x800335e4] Here is the contents of the corePtr corresponding to the page: (kdbx) p coreMap[3916] struct { links = struct { prevPtr = 0x801b9c88 nextPtr = 0x801ae810 } virtPage = struct { segPtr = 0x801852d0 page = 65536 offset = 636 flags = 1 sharedPtr = (nil) } wireCount = 0 lockCount = 0 flags = 23 lastRef = 641065999 } (kdbx) p (char *) 23 0x17 1171. Date: 25 Apr 90 16:02:00 PDT (Wednesday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: FS troubles There are two different things. The double-insert avoided message regards a fixed race condition. You should be able to nuke that helpful message. The other situation is still unexplained. Indeed, we do not want deleted files on the dirty list. I'm pretty sure that you can simply remove the file from the dirty list at the point of the panic because there are no cache blocks associated with it. Since you have a repeatable test case, then I think you should try that. Of course, we still want to know why/who is putting the FSCACHE_FILE_GONE back onto the dirty list, and fix the bug there instaed of patching it later. 1172. Date: Wed, 25 Apr 90 18:20:58 PDT From: elm (ethan miller) Subject: killdebug problems On my SparcStation, killdebug fails to recognize any process in the debugger if its process ID is only 4 digits long. It doesn't print the pid of the process and the process isn't killed. This is obviously not a serious bug, but someone should know it exists. 1173. Date: Wed, 25 Apr 90 19:06:14 PDT From: rab (Robert A. Bruce) Subject: evil file The file /user1/262/aho.bad/.cshrc is a bad file. If you attempt to access it in any way your shell will lock up. I couldn't ls it or stat it or even mv it. 1174. Date: Wed, 25 Apr 90 19:48:29 PDT From: rab (Robert A. Bruce) Subject: /newroot When I run df it says there is 214 meg in use on /newroot, but when I run ls it says there is nothing in the directory. Even . and .. are missing. 1175. Date: Wed, 25 Apr 90 21:44:13 PDT From: shirriff (Ken Shirriff) Subject: IOC_REPOSITION fails on nfsmount of miro Nfsmount doesn't seem to handle IOC_REPOSITION, so fseek fails to anything on /miro. As a consequence, "more" or "vi" of anything on /miro loses the first few bytes of the file. The data lost is from when the program reads in an exec header to check if the file is executable and then fseeks to the beginning to read in the file. Since the fseek doesn't work, 76 bytes are lost on a ds3100 or 32 on a sun. 1176. Date: Thu, 26 Apr 90 08:38:18 PDT From: ouster (John Ousterhout) Subject: More corruption /sprite/guests/stolcke/miginfo/loadavg_c/sun3.md/md.mk ended up with a piece of a mail message from Patterson, sent late yesterday afternoon. mint-6# fsindex -dev rxy0 -part g md.mk md.mk Desc 26908 size 860 kbytes 1 version 0: 222429 -1 * 1 frag(s) offset 1 md.mk 1 blocks 1 seeks /sprite/users/eklee/old2/old/raid.sim/sun3.md/cont.o ended up with a piece of a mail message from Brent, sent late yesterday afternoon. mint-9# fsindex -dev rxy0 -part g cont.o cont.o Desc 6364 size 9704 kbytes 10 version 0: 287504 -1 * 1: 287512 8 2: 175312 -112200 * 2 frag(s) offset 0 cont.o 3 blocks 2 seeks 1177. Date: Thu, 26 Apr 90 08:47:54 PDT From: ouster (John Ousterhout) Subject: More on corruption I did some more analysis on the two corrupted files from yesterday, and found that in both cases the corrupted blocks were on the free list as well as in the corrupted files (I deleted the files and got "FsdmFragFree: block not free" messages; the message is a bit confusing, but means the blocks were already free at the time of the free op). So this implies that the problem is one of disk allocation and not a case of a buggy disk driver accidentally writing the wrong place on disk. 1178. Date: Thu, 26 Apr 90 09:50:41 -0700 From: casotto@canova.Berkeley.EDU (Andrea Casotto) Subject: CAnnot login It looks like .files in my home directory are unreadable or damaged. Try this on gluttony or crackle: % cd ~casotto % ls -al ..... ls -al in my home directory hangs forever. (Plain ls works). Everything was normal yesterday until about 3 pm. 1179. Date: Thu, 26 Apr 90 11:50:16 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: results from debugging allspice Evidently allspice was waiting on a consistency callback to mustard for the file "~casotto/.cshrc". I guess mustard was hanging the rpc. Most of the rpc_servers on allspice were waiting for the consistency callback to finish. We put mustard into the debugger and rebooted allspice. That cleared up the wedged servers. I debugged mustard but was unable to figure out what was wrong. All of the servers were idle. Both machines were running 1.064. Aren't consistency callbacks supposed to timeout after a minute or so if the client hangs the rpc? For whatever reason this didn't happen. 1180. Date: 26 Apr 90 12:43:40 PDT (Thursday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: results from debugging allspice The RPC system still hangs forever if the other machine locks up. It is quite easy to change the RPC code (Mary would know how) so it aborts instead of hanging too long. You have to be a little careful, however, because some RPCs can take a long time for a good reason. Recall the behavior when Allspice is spending all its time writing out its dirty list. You could allow for special-cases by adding a flag to Rpc_Call, perhaps, or passing an upper bound on the time to wait. Prefix broadcasts are already special cased, although in a cruder way. 1181. Date: Thu, 26 Apr 90 14:04:54 PDT From: mendel (Mendel Rosenblum) Subject: /swap1 disk busy I was wondering why with 80Megabytes of file cache the /swap1 disk was constantly being accessed. In the file write procedure Fsio_FileWrite() I saw the following comment after the cache write: if (writePtr->flags & FS_SWAP) { /* * While page-outs on the file server go to its cache, we * inform the cache that these pages are good canidicates * for replacement. */ Fscache_BlocksUnneeded(streamPtr, savedOffset, savedLength, FALSE); This puts client swap pages on the front of the server's LRU list which makes them the first blocks replaced when a block is needed. This combined with migration's use of the server cache forces clients to page from the server's disk rather than the server's cache. 1182. Date: Thu, 26 Apr 90 17:53:47 PDT From: douglis (Fred Douglis) Subject: another post-recovery crash kvetching wedged right after allspice came back, with a garbage pointer passed as a list header so a list_forall looped forever. 1183. Date: Fri, 27 Apr 90 11:38:26 PDT From: Fred Douglis <douglis> Subject: ds3100 bad libc i got an error message during a compile when i was accidentally linking with the libc.a in the library source area: Object file format error in: /sprite/src/lib/c/ds3100.md/libc.a(`): bad file magic number "ar tv" on that file generated: ar: Error: phase error on umask.o so, i'm just going to rebuild the C library from scratch. a lot of it hasn't been recompiled in many months. 1184. Date: Sun, 29 Apr 90 17:02:35 PDT From: ouster (John Ousterhout) Subject: wall still hangs Why is it that wall hangs every time when I run it? I vaguely remember some reason about a bogus rlogin device, but can't the offending device simply be deleted? 1185. Date: Mon, 30 Apr 90 09:55:08 PDT From: Fred Douglis <douglis> Subject: Re: wall still hangs yes, it's a problem with an rlogind process being associated with a /hosts/foo/rlogin* file but not actually responding to the pdev operations on it. unless pdev operations are more careful, or rpcs can be made interruptable, then wall will hang on those. it can't remove the file because it's hung before it knows the file is bad. i suppose i can change wall to fork a child, wait for the child to exit, and timeout and remove the file if the child doesn't return in a minute or so. but this will still generate many permanently wedged processes. 1186. Date: 30 Apr 90 10:13:50 PDT (Monday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: wall still hangs The pseudo-device client code can be more careful when opening a pseudo-device. The PDEV_SETUP state bit should be re-introduced. Current the bug arises when a client catches a pseudo-device that has not bee initialized by the server, but only partially opened. The client waits (forever) for the server to start up. Instead, its open should fail. Fix Fspdev_PseudoStreamIoOpen, or what ever routine does the PDEV_OPEN request-response. 1187. Date: Mon, 30 Apr 90 14:23:08 PDT From: hohmeyer (Michael E. Hohmeyer) Subject: sprite crashes.. Greed has been crashing every other day or so with a "TLB LD miss" error. If anyone wants to investigate this I can notify you the next time it happens. 1188. Date: Mon, 30 Apr 90 22:14:23 EDT From: douglis@piquante.Berkeley.EDU (Fred Douglis) Subject: damn sun4 binaries still on oregano, of all places... >From MAILER-DAEMON@sprite.Berkeley.EDU Mon Apr 30 19:11:40 1990 Received: from sprite.Berkeley.EDU (allspice.Berkeley.EDU) by ginger.Berkeley.EDU (4.1/1.41) id AA01089; Mon, 30 Apr 90 19:11:38 PDT Received: from rosemary.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA331332; Mon, 30 Apr 90 19:11:08 PDT Date: Mon, 30 Apr 90 19:11:08 PDT From: MAILER-DAEMON@sprite.Berkeley.EDU (Mail Delivery Subsystem) Subject: Returned mail: Service unavailable Message-Id: <9005010211.AA331332@sprite.Berkeley.EDU> To: owner-sprite-log@sprite.Berkeley.EDU Status: R ----- Transcript of session follows ----- 451 Cannot exec /sprite/cmds/sh: no such file or directory 554 "|/users/sprite/cmds.gen/logger sprite log 'Sprite Log'"... Service unavailable 451 Cannot exec /sprite/cmds/sh: no such file or directory 554 "|/users/sprite/cmds.gen/logger sprite log 'Sprite Log'"... Service unavailable mail: mail: /tmp: cannot open for writing ----- Unsent message follows ----- Received: from rosemary.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA331329; Mon, 30 Apr 90 19:11:08 PDT Received: by rosemary.Berkeley.EDU (4.1/1.41) id AA00953; Mon, 30 Apr 90 19:11:22 PDT Date: Mon, 30 Apr 90 19:11:22 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Message-Id: <9005010211.AA00953@rosemary.Berkeley.EDU> To: bugs@sprite.Berkeley.EDU Subject: oregano out of memory i tried to dump its memory stats but as soon as I called Mem_PrintStats it froze up completely and I could no longer get it in the debugger. Perhaps I needed to call some internal print routine, and it deadlocked on a monitor lock? 1189. Date: Mon, 30 Apr 90 23:36:42 PDT From: pmchen (Peter M. Chen) Subject: ps hangs on gluttony I think there's a process on gluttony which is killing ps. I don't know which one it is, though. ps -au dies after printing out a number of processes. ps (as pmchen) finishes fine, as does ps as johnw, as casotto, and as bsmith. 1190. Date: Tue, 01 May 90 12:19:21 PDT From: rab (Robert A. Bruce) Subject: dumps The dump failed yesterday with a write error. Warning: Exabyte 8200 at SCSI3#0 Target 5 LUN 0 error: media error - info bytes 0x0 0x0 0x0 0x12 Warning: Exabyte maximum write retries attempted Warning: Exabyte 8200 at SCSI3#0 Target 5 LUN 0 error: media error - info bytes 0x0 0x0 0x0 0x13 Exabyte File Mark Error 1191. Date: 1 May 90 11:08:07 PDT (Tuesday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: distribution bugs Here are some more problems I've turned up in the Sprite distribution. kernel bugs: As I already mentioned, I fixed the Xylogics driver regarding the byte-ording of the low-level sector header information. This is in the uninstalled dev code. I also added Fsio_BootTimeTtyOpen to fsDevice.c. This goes with a fix in devTty.c regarding FS notify tokens. Finally, this is called from mainInit.c, but this call is only in main.brent/sun3.md/mainInit.c. I think someone at Berkeley should shepard these kernel changes in. I've only tested it on a Sun3, but the tty stuff is all machine independent anyway. This fix enables the L1 key before the "idling for 5 seconds" message. nfsmount bug. Actually it is in the pdev.c library package. Any SET_ATTR call and those GET_ATTR calls that returned an error status would clean the selectBits for the open stream. This causes subsequent I/O on the stream to block. I've fixed the source (I found the bug today), but lost my network connection as I was checking in pdev.c. rpcgen. This is the Sun RPC stub compiler. It no longer likes the mount.x and nfs_prot.x specification files. Has someone installed a new rpcgen? These .x files are missing from the distribution, too. I've finally gotten a fixed nfsmount, but the thing won't really recompile from scratch because of rpcgen. I had to copy intermediate files from Berkely and touch things... /sprite/main/lib.fmt/c is missing. This directory needs to exist in order to save formatted man pages of the C library. The -lc_g library doesn't exist. It is a link into the source area and there is no target file. Oh, perhaps the most crucial bug concerns the mount table. It is not documented, but it is crucial that the partition that the kernel mounts during bootstrap have group/pass "root". This is the magic thing that causes your machine to reboot after fixing errors in the bootstrapped partition. The example mount table neither contains this or documents it. Also, the mount table documents "passes" instead of "groups". Fsattach now uses disk groups instead of passes. I don't think the man page for fsattach documents this either. 1192. Date: Tue, 01 May 90 12:31:35 PDT From: Fred Douglis <douglis> Subject: huge stack killed allspice mendel looked into why allspice crashed before. the reason why the connection died was because your check-in of pdev.c went amok. it had a stack of 241MB, and due to some bug regarding preemption, other processes were never getting to run while it was servicing page faults for the ci process. when allspice came back it died with the same cache writeback problem during client recovery, and it had to be rebooted again. it was down for a total of almost 2 hours. 1193. Date: Tue, 1 May 90 16:18:28 PDT From: mendel (Mendel Rosenblum) Subject: SCSI DMA error is back Allspice got one of those SCSI DMA errors on disk rsd31c. It was reading a file descriptor block during the fscheck. 1194. Date: Tue, 1 May 90 17:22:38 PDT From: bsmith (Brian Smith) Subject: File lock bug flock() is slightly broken, but the fix is easy. When you try to unlock the file, you have to pass, in addition to the LOCK_UN flag, an or'ed in LOCK_EX, LOCK_SH or LOCK_NB flag. Try it out with something like: main () { FILE *f; f = fopen ("foo", "r"); while (1) { flock (fileno(f), LOCK_EX); flock (fileno(f), LOCK_UN); printf (".\n"); } } This never prints anything on the screen. If you change flock (fileno(f), LOCK_UN); to flock (fileno(f), LOCK_UN|LOCK_EX); though, it runs fine. 1195. Date: Tue, 1 May 90 22:55:16 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bad cache block This evening the /sprite/src/ directory appeared to be empty. There was something wrong with the cache block. I flushed allspice's cache and the problem went away. Brent reported this bug before. I'm not sure if this is a new bug, or if it is just different symptoms of another bug (perhaps the file-trashing bug?). Next time it occurs we should debug the machine, although it isn't easy to detect. The symptom is missing files, but there are lots of reasons why files may be missing. 1196. Date: Wed, 2 May 90 09:09:16 PDT From: mendel (Mendel Rosenblum) Subject: rpn or tx bug When you exit rpn in a tx window on a sun4 the error message "Couldn't find variable "ti"" is printed in the tx error message window and the window is left in a funny state where it doesn't scroll correctly. 1197. Date: Thu, 3 May 90 09:18:09 PDT From: brent (Brent Welch) Subject: library sources missing from distribution I have the following library sources: c cmd curses dbm include l m monitorClient mxx net I do not have sources for: acu bootBin g++ memtrace pattern ps sunrpc sx tcl termlib test util (This doesn't count X libraries, of course) I need to fix something in sunrpc, for example, so I'll have to suck up the sources for that. I've also been using memtrace, but had to rlogin to Berkeley to remember what the routines were. 1198. Date: Thu, 3 May 90 10:03:53 PDT From: mendel (Mendel Rosenblum) Subject: allspice file cache problem The directory /sprite/src/kernel got corrupted in allspice's cache. I did a "fscmd -f" and it was re-read from disk correctly. The cached copy of the attributes contained: allspice% od -X /sprite/src/kernel 0000000 31302034 30203130 39302037 33302031 0000020 300a3320 34352031 38302036 34302032 0000040 36302036 36302034 20310a36 20343820 ... 006340 31302031 35302030 20310a33 20352035 0006360 30203530 20313730 20313430 20302031 0006400 0a000000 00000000 00000000 00000000 0006420 00000000 00000000 00000000 00000000 * 0010000 allspice% strings /sprite/src/kernel 10 40 1090 730 10 3 45 180 640 260 660 4 1 ... allspice% 1199. Date: Thu, 3 May 90 23:21:48 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: lost file The version of netRoute.c that I've been editing for the last couple of weeks is now empty. This does not make me very happy. I guess the file was in piracy's cache when the machine crashed. I do have an Mx log that goes back to yesterday, and the file was dumped last evening. From this I hope to get most of the changes back. I was hoping to apply the log to the RCS'd version of netRoute.c. This doesn't work for a number of reasons. First of all, there are a bunch of cryptic numbers at the top of the log file which mx uses to determine whether the log belongs to this particular file. If it doesn't then it silently ignores your request to use the log. Once I managed to convince it to use the log it gave me "Mx_ReplaceBytes: bad range." and went into the debugger. Is there anyway to apply the log to the file, even if the log was started on a slightly different file? I'd be happy just to get half of my changes back. 1200. Date: Fri, 04 May 90 12:54:13 PDT From: Fred Douglis <douglis> Subject: can't install sun libs from ds3100 it generates a bogus archive member when it does the ranlib. 1201. Date: Fri, 4 May 90 15:07:54 PDT From: mendel (Mendel Rosenblum) Subject: Re: tftpd vs inetd? The sprite tftpd is not implemented as a server that can be started by inetd. It should be started in bootcmds and not be in the inetd.conf file. 1202. Date: Fri, 04 May 90 17:17:05 PDT From: Fred Douglis <douglis> Subject: X server went amok a few minutes ago i killed ann's X server, which had grown to 40 MB and was causing allspice to hang up. this message is intended to serve two purposes: bring up the question of why ann's server keeps going so crazy, and bring up the question of what we can do about allspice hanging up when a process runs amok. i know we've discussed this from time to time but it doesn't seem we've resolved anything except that ken's new "plumbing" will fix it eventually. i still think we should install a stop-gap measure -- either limit process sizes, or catch processes that are thrashing, or something. by the way, mendel pointed out that when i removed the special "put on front of LRU list" code for swap pages, i only caught reads and not writes. the writes are really the problem, since they make writes go straight through to disk. perhaps once the correct fix makes it to allspice, allspice's cache will be more effective in guarding against huge processes. or maybe it will just take longer to go crazy, in which case we need to catch the processes that are doing it. this fix is not yet in any kernels but is in mendel's copy of the file system. 1203. Date: Fri, 4 May 90 22:36:59 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: migration / file bug I'm currently running the shell that migrates my jobs. If I migrate a job whose stdout is redirected into a file, then the file will have zero size on my machine, but will have the correct size and contents on the machine that was migrated to. This happens on both hijack and parsley. 1204. Date: Fri, 4 May 90 22:39:08 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: another migration bug If I migrate a job that is writing to a file after it is running then the file size does not change on my machine. For example, I migrate the ld of a kernel after it has started. The file size on my machine will be whatever it was when the job was migrated. The other machines see the correct size. This is an old bug that I thought had been fixed. 1205. Date: Fri, 04 May 90 23:56:53 PDT From: Fred Douglis <douglis> Subject: Re: another migration bug i think this was possibly fixed and then reintroduced. i discovered it a couple of days ago when testing different scenarios for cache flushing during migration. it's because the old attributes for the file wouldn't get invalidated when it got migrated. this is fixed in the latest fsconsist. 1206. Date: Mon, 07 May 90 10:06:15 PDT From: rab (Robert A. Bruce) Subject: dumps The dumps did not complete last night. When I came in this morning murder was comatose. It won't go into the debugger and it doesn't respond to ping or kmsg. 1207. Date: Mon, 7 May 90 11:46:02 PDT From: mendel (Mendel Rosenblum) Subject: allspice crash at 11:00 Allspice crashed with a crash write-back error. The memory error register contained: 0xc4 Write back invalid translation. (During a write back bus cycle an invalid transaltion exception occured.) Context number 0. The memory error address register contained: 0xfff113f0. This is in the kernel DVMA address space. I was able to continue the machine. 1208. Date: Mon, 07 May 90 14:51:12 PDT From: Fred Douglis <douglis> Subject: major fs screw-ups i installed a new migd this morning after changing it per fubar's comment. when i rebooted my machine this afternoon it died in bootcmds when migd hit a segmentation violation. turned out migd was only 65K long. i removed it and ran update again from the source area. still hit a segv. this time the file was normal length, but rather than containing the data i just copied it contained some random garbage, especially mail from ouster & mgbaker's spool files! even more oddly, when we looked at this file a little while later it contained different data. must be getting rewritten even as we speak. the file is /sprite/trashed/migd. 1209. Date: Mon, 7 May 90 22:43:56 PDT From: culler (David Culler) Subject: rlogin to ds3100 broken? Lately I've found that I cannot rlogin from ernie to cardamom, but I can to Sage. I did this from home and came in later to find cardamom was o.k. I could ping it from ernie even when I could not rlogin. I did not receive an error message or a timeout. Rlogin would just hang. 1210. Date: Mon, 7 May 90 23:58:40 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: allspice I had to debug and reboot allspice this evening. /sprite/src/kernel had become locked. I wasn't able to figure out why that file was locked, but I did find that there were two processes deadlocked over the file "proc.h". One of them was in Fsconsist_Close and was waiting on the consistency monitor lock. It had locked the file handle in Fsrmt_RpcClose when it called the client verify routine (I think this is true but I could be wrong because those function tables always confuse me). The other process was trying to lock the handle in Fsutil_HandleLockHdr, and had grabbed the consistency monitor lock in Fsconsist_GetClientAttrs. 1211. Date: Tue, 8 May 90 11:59:55 PDT From: elm (ethan miller) Subject: small problem in mig When using the -h option, the -b option doesn't work. This is from my sun4c (terrorism) to any other sun4c (I tried tyranny and sedition). When I specified it w/o the -h option, it worked fine to tyranny. I'd like to use this because, in this particular case, one machine that was a target for migration was about to be rebooted, so I wanted to run the process elsewhere. 1212. Date: 7 May 90 17:39:07 PDT (Monday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Gremlin fonts When I try to run gremlin here at PARC I get a "Couldn't get font file" error. I don't seem to have the sources (distribution error?), so I can't figure this out right away. Does anyone know where gremlin expects to find its fonts? Under my X11R3/lib/fonts I see 100dpi 75dpi local misc xproof 1213. Date: 8 May 90 12:03:00 PDT (Tuesday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: allspice First, if proc.h is somewhere under /sprite/src/kernel, then that directory can become locked due to a chain-reaction. Directory scanning grabs HandleLocks on the parent and the child as it descends... Now, the deadlock is interesting! Ahh, it is even described in Fsconsist_GetClientAttrs: Client 1 does a get attributes about the same time Client 2 does a close. Client 2 has its handle locked during the close, but we will be calling back to get its attributes. Our callback (on behalf of Client 1) can't start until Client 2 unlocks its handle. But Client 2 won't unlock its' handle until its close finishes. The close can't finish because Client 1 is blocked on the locked handle. To guard against this, the handle is unlocked inside Fsconsist_GetClientAttrs. The problem is that the handle is re-locked just before exiting, and this is still inside the monitor lock. Its a small window, but it was found. I think this is a general problem with the cache consistency lock. The handle can't be locked during cache consistency because unrelated open, close, read, write operations need to take the handle lock. I had coded things so that you entered the cache consistency monitor with the handle locked, and those routines released the handle lock during the callbacks and then reaquired it before releasing the monitor lock: External routine locks handle Enter coche consistency monitor release handle lock do consistency stuff, including callbacks. lock handle release monitor lock. It is the order of the last two steps that caused the deadlock. I think you can argue that this structure only prevents deadlock with operations that are triggered by a callback (which was my initial concern), but it doesn't prevent deadlock with unrelated actions. A better structure, perhaps, would be to never enter the cache consistency monitor with the handle locked. Processes could "slip by" each other at the point of monitor entry, but I still can't think of a problem that would cause. All important state changes about consistency, anyway, are done under the monitor. The handle lock is used during directory scanning and as a default way to synchronize over an object. A handle can't go away until after it has been "released", so unlocking it doesn't pose that problem either. In short, someone at Berkeley should spen a little time familiarizing themselves with the use of the cache consistency routines, and see if they can't find a whole in my argument. The current structure is probably over-conservative. I know that I always went with more locking at first because I couldn't guess at all the potential races. The penalty is deadlock, however. 1214. Date: Wed, 9 May 90 10:04:04 PDT From: ouster (John Ousterhout) Subject: Another corrupted file This happened a few days ago. The file is /sprite/users/eklee/old2/bin/mkcc: mint: ls -l mkcc -rwxr-xr-x 1 eklee 1359 Feb 21 11:32 mkcc* mint: /sprite/admin.sun3/fsindex -dev rxy0 -part g mkcc mkcc Desc 26481 size 1359 kbytes 2 version 0: 69104 -1 * 2 frag(s) offset 0 mkcc 1 blocks 1 seeks Fsblockcheck didn't turn up any other (current) uses of the trashed block. The block contained a piece of a mail message to Jim Hunt. 1215. Date: Wed, 9 May 90 13:22:46 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: common lisp, getwd, and unix compatiblity One big obstacle to our quest for binary compatiblity is our handling of remote links. Unix binaries do not understand them. In particular, getwd() computes the current directory by working its way up to the root. It does this by 'stat'ing the current directory then opening the parent directory and looking for an entry with the same device and inode number. This algorithm doesn't work on Sprite when it gets to the top of a domain. It will stat the directory at the top of the domain, then open the parent and see (and ignore) the remote link. I'm not sure how to fix this, but if we truly want to be binary compatible then we better hide remote links from unix programs. If we want unix programs to have any performance then we better fix it so 'stat'ing a remote link does not always cause a broadcast. Both of these items are old news to some of you, but it seemed appropriate to bring it up again. Dave, this is why common lisp does not run correctly. 1216. Date: Wed, 9 May 90 13:25:26 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: addendum I might add to that last message that the Sprite version of getwd() uses the prefix table and thereby avoids 'stat'ing everything in the root directory. We may want to retain this functionality. 1217. Date: Wed, 9 May 90 13:43:47 PDT From: mendel (Mendel Rosenblum) Subject: Memory usage VM verse file cache The file system refuses to give blocks to the VM system when all the blocks in the cache are dirty. This means that if you write a large file so that all the blocks in the cache are dirty and then start a large process the process will thrash. If you continue to write the file the cache will stay dirty and the process will continue to thrash. The problem is even worse on machines with a pagesize greater than 4K. In these systems, all the blocks on a page must be clean before the page can be returned to the file system. This means that the problem can happen well before the entire cache becomes dirty. 1218. Date: Thu, 10 May 90 08:52:50 PDT From: ouster (John Ousterhout) Subject: More file corruptions The checksummer at script somehow got lost, so it didn't run for a few days. I restarted it yesterday, and last night it found two more errors. Both these files seem moderately important; Bob, can you restore them from dump tape? 1. /sprite/lib/ditroff/ds3100.md/devsun/nb.orig was corrupted with what seems to be information from the migration daemon (?). Here's a sample of what was at the tail end of the file: ContactGlobal - completed successfully Error 32 writing to global daemon: broken pipe. ContactGlobal - Thu May 3 22:10:41 1990 ContactGlobal - completed successfully Error 32 writing to global daemon: broken pipe. ContactGlobal - Fri May 4 08:01:41 1990 Here's fsindex and fsblockcheck information for the file: mint-7# fsindex -dev rxy0 -part g nb.orig nb.orig Desc 30250 size 1528 kbytes 2 version 0: 322676 -1 * 2 frag(s) offset 0 nb.orig 1 blocks 1 seeks mint-8# fsblockcheck -D rxy0 -P g 322676 Checking block 322676 Block 322676: Start-frag=0 Num-frags=2 FD=30250 Block 322677: Start-frag=1 Num-frags=2 FD=40603 Note that this time the block is actually present in two files at the same time. Note that the blocks only OVERLAP: they don't coincide. I also found the other file: it's /sprite/admin/migd/raid1.Berkeley.EDU.log. Here's fsindex information for it: mint-20# fsindex -dev rxy0 -part g raid1* raid1.Berkeley.EDU.log Desc 40603 size 13469 kbytes 14 version 0: 166932 -1 * 1: 190660 23728 * 2: 190460 -200 * 3: 322677 132217 * 2 frag(s) offset 1 raid1.Berkeley.EDU.log 4 blocks 4 seeks In order to clean up the free list I copied the migd log onto itself and deleted the nb.orig file. 2. /sprite/cmds.sun3/Pnews has binary-looking junk at the end of an otherwise textual file. I can't make any sense of the binary junk. Here's what fsindex and fsblockcheck have to say: mint-31# fsindex -dev rxy0 -part g Pnews Pnews Desc 84816 size 9786 kbytes 10 version 0: 75932 -1 * 1: 75960 28 2: 222640 146680 * 2 frag(s) offset 0 Pnews 3 blocks 2 seeks mint-32# fsblockcheck -D rxy0 -P g 222640 Checking block 222640 Block 222641: Start-frag=1 Num-frags=2 FD=57131 Block 222640: Start-frag=0 Num-frags=2 FD=84816 Hmmm, this one's in two files too, with the same kind of overlap. In this case the other file is /sprite/lib/fonts/pk/cmmib10.58pk. But this is very weird: BOTH FILES ARE OLD!!! The fonts file dates from 1987 and the Pnews file from 1989. Here's the fsindex information: mint-36# fsindex -dev rxy0 -part g cmmib10.58pk cmmib10.58pk Desc 57131 size 1840 kbytes 2 version 0: 222641 -1 * 2 frag(s) offset 1 cmmib10.58pk 1 blocks 1 seeks This corruption seems to contradict the theory that the problem is in the block allocator. I'm now beginning to wonder if perhaps file descriptors are getting corrupted in memory and then written back to disk. 1219. Date: Thu, 10 May 90 21:35:41 PDT From: Fred Douglis <douglis> Subject: rpc forgot reply i saw a flurry of messages like the following: RpcResend: RPC 23, client 20, RPC seq # 212e84, forgot reply? can anyone tell me what this means and whether i should pay attention to it? i presume it happened during a migration to parsley (host 20). 1220. Date: 11 May 90 13:54:48 PDT (Friday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: rpc forgot reply The comments in RpcResend aren't that great. The "forgot reply" message really means that a client requested a re-send of a reply, but the server process had started on a different request. This could happen if the server somehow couldn't get through to the client in order to explicitly close a channel, in which case it gives up and closes the channel anyway. After that, the client may resend and catch the server in this funny state where it doesn't have a reply to resend. 1221. Date: Fri, 11 May 90 14:52:56 PDT From: Fred Douglis <douglis> Subject: nfsmount disappears since /miro/* has been set up, seth has complained 2-3 times that the nfsmount daemons have disappeared from bribery. gone without a trace... 1222. Date: Fri, 11 May 90 17:04:23 PDT From: shirriff (Ken Shirriff) Subject: crash in VmListInsert Mustard crashed in VmListInsert, runinng the "forstall" kernel. The problem seemed to be that a process was migrating and a page was being freed, but when the page was added to the free list, VmListInsert complained "Inserting element twice". Since this wasn't a normal Sprite kernel, this can probably be classified as a random crash. 1223. Date: Fri, 11 May 90 17:51:44 PDT From: Fred Douglis <douglis> Subject: mint recovery hiccup mint was in chaos for about 10-15 minutes late this afternoon after mustard rebooted. things started timing out. then, for some reason mint was spending so much time printing "dropping regular open during recovery" that it wasn't making much progress on recovery. It finally cleared up; we're not sure if doing "L1-N" to reset its net interface had anything to do with this or not. by the way, "/" is filling up due to the huge syslogs people are writing to /hosts/%HOST/syslog.out. i propose that we use /sprite/syslogs/%HOST until we move to /newroot. 1224. Date: Sat, 12 May 90 09:39:15 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: mint locked up tons of ready processes. the one running process was not at a reasonable point --- either the sources were different or the stack was fouled up or something. the ready queue was fine but Sched_ContextSwitch was never being called. I'm dialed in so I finally just rebooted. Someone should try to check mint's console log to see if there's any clue there.... 1225. Date: Sun, 13 May 90 20:34:31 PDT From: Fred Douglis <douglis> Subject: ds3100 vm panic: bad list kvetching died during a page-in with a bad list -- seems TLBHashInsert was inserting with pid 0 which had a "pidListElems[pid].tlbList" of NIL. 1226. Date: Mon, 14 May 90 09:36:15 PDT From: pmchen (Peter M. Chen) Subject: mail not delivered from non-sprite to sprite Mail from the non-sprite world to sprite is not getting through (or it's taking real long, like days). Mail from sprite to sprite is fine. Mail from sprite to the outside works fine. 1227. Date: Mon, 14 May 90 09:42:21 PDT From: Fred Douglis <douglis> Subject: Re: mail not delivered from non-sprite to sprite i tried "telnet allspice 25" to see if the sendmail daemon was responding. it wasn't. i was able to rlogin and i saw that the sendmail daemon was around. i think there's some sort of bug in the ipServer that causes it to lock up connections after a while (i doubt the bug is in sendmail, which is straight from BSD). the same thing happens to inetd, especially on allspice. i restarted the sendmail daemon. thanks for pointing this out. every so often when i don't get mail for a while i check on allspice's daemon, but i thought i'd checked recently. guess not. 1228. Date: Mon, 14 May 90 12:08:21 PDT From: Fred Douglis <douglis> Subject: leak in pdev i found out that mendel's bug report about migd growing so large was from a core leak in pdev.c. it allocates read & ioctl buffers on a per-stream basis but never frees them! (if they are reallocated they're freed, but the last allocation sticks around.) i wonder if this could account for any remaining core leaks in the ipServer? it certainly accounts for migd, since each request for load averages would pitch 2K. one thing, though. remember when allspice crashed when you were trying to check in pdev.c? well, i tried checking it out but the file was "busy". so i removed the lock file in RCS and tried again. it didn't tell me that the file had been modified or anything. my question is, did the file become read-only to you before it was actually checked in? in that case, any changes you made may have been lost when i checked out the version RCS knew about. 1229. Date: Mon, 14 May 90 12:48:41 PDT From: elm (ethan miller) Subject: terrorism (sun4c) crash This crash occurred while running Bruce Forstall's kernel, so it may not be a true sprite bug, but.... Fatal Error: MachPageFault kernel page fault at illegal PC 0xf605ef40 addr 0xdaf29814 Entering debugger with an Interrupt Trap (16) exception at PC 0xf608f0c8 I found the machine in this state when I got back on Monday morning, and I rebooted it. 1230. Date: Mon, 14 May 90 18:12:08 PDT From: Fred Douglis <douglis> Subject: another major fs f**kup: /tmp this time remember my mail about a binary in /sprite/daemons.ds3100 having random junk in it, and changing even as i watched? looks like /tmp files are getting trashed similarly. ccom is being directed at a file that contains a combination of my pmake output, mary's kernel install (i assume), and binary data. --- ds3100.md/migd.o --- ccom: Error: , line 1: syntax error rm -f ds3100.md/Mig_GetPdevName.o ---^ ccom: Error: , line 123: a / was found, but '.' expected; an ellipsis was inserted llib-lfsutil:../fs/fsStat.h ---------------^ ccom: Error: , line 262: a / was found, but '.' expected; an ellipsis was inserted llib-lfsutil:../fs/fsStat.h ---------------^ [20 similar lines deleted] ccom: Error: , line 5302: illegal character: 221 (octal) * : --^ (ccom): , line 5302: ccom: Internal: too many errors * : --^ *** Error code 1 `install' not remade because of errors. Compilation finished at Mon May 14 18:09:19 1231. Date: Mon, 14 May 90 18:13:47 PDT From: Fred Douglis <douglis> Subject: addendum the problem of ccom being pointed at garbage happened twice over a period of about 5 minutes. each time, restarting pmake worked fine, or at least it apparently worked fine. (that's what scares me. the last time, i installed a binary and didn't know it was copied into a garbage "file from hell" until i couldn't boot a machine!) 1232. Date: Tue, 15 May 90 10:20:48 PDT From: Fred Douglis <douglis> Subject: tx panic tx panicked when it couldn't grab the pointer after i accidentally hit a menu just as i was lowering the window. this is a fatal error, but it seems like it doesn't have to be fatal. 1233. Date: Tue, 15 May 90 15:21:02 -0700 From: pmchen@sprite.Berkeley.EDU (Peter M. Chen) Subject: gremlin font sizes The default font sizes that come up on the screen (gremlin) are significantly smaller than the default font sizes that come up using grn. I'd guess they're about 4 points smaller on the screen. Can you look into this? It's been the case for about half a year, but before that it was fine. 1234. Date: Tue, 15 May 90 15:43:38 PDT From: tve (Thorsten von Eicken) Subject: Re: gremlin font sizes The fonts may have changed when we started switching away from X11R2. Please try gremlin now with an X11R4 server (and possibly gremlin compiled with the X11R4 library), if you haven't done so already. If that fixes the problem, great; if not, I tend to feel that gremlin is broken but that I could assist you in tracking down the problem. It would certainly help if you could do some cross-platform tests, like gremlin on sprite and X on sunOs, or the reverse. 1235. Date: Tue, 15 May 90 17:13:07 PDT From: mgbaker (Mary Gray Baker) Subject: timer queue problems fixed... If you ever have a problem with a timer queue being messed up, dying in List_Insert() or List_Remove(), look for changes in size in, for instance, the fs_Stats structure. One of the common timer queue elements is statically allocated right after the fs_Stats structure. I'm mentioning this because I've had this problem before with the timer queue, and because Fred wanted me to mention it so it would be logged. 1236. Date: Wed, 16 May 90 13:54:38 PDT From: mgbaker (Mary Gray Baker) Subject: rosemary strings and kernel install The strings program on rosemary, last updated April 24th, no longer recognizes the VERSION string in the sun4, sun4c or sun3 kernels. (It recognizes it in the decstation kernel.) This means that the "Save" done as part of the install only moves the kernel to a name such as "sun4.". 1237. Date: Wed, 16 May 90 15:16:32 PDT From: ouster (John Ousterhout) Subject: More file corruptions in /sprite 1. /sprite/lib/fonts/pk/icmex10.69pk was corrupted with part of a mail message. mint-2# ls -l icmex10.69pk -rw-r--r-- 1 root 1464 Oct 14 1987 icmex10.69pk mint-3# fsindex -dev rxy0 -part g icmex10.69pk icmex10.69pk Desc 57862 size 1464 kbytes 2 version 0: 7996 -1 * 2 frag(s) offset 0 icmex10.69pk 1 blocks 1 seeks 2. /sprite/lib/fonts/pk/ilcmssi8.746pk was corrupted with what appears to be migd log messages. mint-4# ls -l ilcmssi8.746pk -rw-rw-r-- 1 root 1464 Oct 25 1987 ilcmssi8.746pk mint-5# fsindex -dev rxy0 -part g ilcmssi8.746pk ilcmssi8.746pk Desc 58529 size 1464 kbytes 2 version 0: 172176 -1 * 2 frag(s) offset 0 ilcmssi8.746pk 1 blocks 1 seeks 3. /sprite/doc/ref.ancient/cmds/.proto was corrupted with what appears to be RCS information from biglibtop.mk,v (from /sprite/lib/pmake?) mint-6# ls -l .proto -rw-rw-r-- 1 deboor 668 Oct 19 1987 .proto mint-8# fsindex -dev rxy0 -part g .proto .proto Desc 24976 size 668 kbytes 1 version 0: 1238 -1 * 1 frag(s) offset 2 .proto 1 blocks 1 seeks I need to check some of the other bug reports to be sure, but it seems to me that it's always fragment 1 that ends up being shared between two files. 1238. Date: Wed, 16 May 90 17:13:48 PDT From: ouster (John Ousterhout) Subject: Migration database in distribution When I ran "finger" on the Sprite machine at CMU (booted from the distribution tape, it died with the error "could not access migration database". I believe that this is because a file is missing? I also think this bug was present in earlier distributions and was reported. Bob, can you fix this in the distribution, and also add a note to your list of tests to run when you're testing new distributions to test finger and rup? 1239. Date: Thu, 17 May 90 09:44:49 PDT From: Fred Douglis <douglis> Subject: distribution migd problem I was able to get things working at CMU: [testarossa.mach.cs.cmu.edu] Login Name Idle When Where ouster John Ousterhout 5 Tue 08:25 testarossa (tyranny.Berkel) root The Sprite God 2 Mon 13:58 testarossa The problems were: 1) bootcmds started up loadavg rather than migd. 2) migd was not setuid to root (and there was no /sprite/src/daemons/migd). another separate problem is: 3) testarossa's date is 9:39 AM PDT on some date in 1972. Let's hear it for bootstrapping. By the way, I *think* the other programs are set up to use migd rather than the old shared ascii file.... a quick pmake test said no hosts were available, but i guess that would happen if the shared file was just empty, so that doesn't say much. finger uses migd. since they only have one host in spritehosts, it probably doesn't much matter... 1240. Date: Thu, 17 May 90 14:22:26 PDT From: Fred Douglis <douglis> Subject: nuke the decwriter!!! mint had another recovery storm when it rebooted after a cache writeback problem (due to running an old kernel without the bug fix). i finally tried typing "cat /hosts/mint/dev/syslog" from my machine. after about 5 minutes of recovery storm it snuck in there and started catting the file, at which point the storm ended. when i continued jaywalk, larceny, and treason, that window printed about 30 lines of messages in rapid succession. if that had been going to mint's console everyone would have started timing out again. 1241. Date: Thu, 17 May 90 14:23:57 PDT From: Fred Douglis <douglis> Subject: mail trashed (real message) how appropriate that my mail about mail getting trashed came out twice as an empty message! it was really just saying the bob miller sent mail before about printer trouble, and i, for one, got only a garbage message. (i got a real copy of it on unix, which is how i know what it was). 1242. Date: Thu, 17 May 90 14:30:12 PDT From: root (The Sprite God) Subject: runaway csh after boot there was a csh -i process in a loop, with a parent process of initSprite. i killed it. i don't know if this was because bootstrap failed or it was something else. 1243. Date: Fri, 18 May 90 09:05:01 PDT From: ouster (John Ousterhout) Subject: Another corrupted file The good news is that so far I've never found a corrupted file except on /sprite. Today's victim is /sprite/lib/mkmf/RCS/Makefile.hdrs,v. Bob, can you restore this file from dump tape? It was the unwilling recipient of a piece of an outgoing mail message. The corrupted fragment was once again fragment 1 out of a 4-fragment block. Here's the relevant information: mint-1# ls -l Makefile.hdrs,v -r--r--r-- 1 rab 1666 Oct 9 1989 Makefile.hdrs,v mint-2# fsindex -dev rxy0 -part g Makefile.hdrs,v Makefile.hdrs,v Desc 6344 size 1666 kbytes 2 version 0: 336732 -1 * 2 frag(s) offset 0 Makefile.hdrs,v 1 blocks 1 seeks 1244. Date: Fri, 18 May 90 14:33:13 PDT From: mendel (Mendel Rosenblum) Subject: mkmf bug with .o files When you type "mkmf" in a directory of type "bigcmd" (/sprite/src/attcmds/rpn for example) you get an error message of: WARNING: file ds3100.md/linked.o does not have a source file 1245. Date: Fri, 18 May 90 15:42:34 PDT From: Fred Douglis <douglis> Subject: xman broken under R4 the R3 xman is broken with the R4 server (i get lots of "parameter not a window" messages). i presume it won't compile under R4 without changes. is there an xman distributed with R4, or does anyone know what must change? 1246. Date: Fri, 18 May 90 16:28:55 PDT From: elm (ethan miller) Subject: tx compatibility with "normal" terminal Would it be possible to include a mode in tx which makes it compatible with a normal terminal, such as an xterm or vt100? The reason I ask is that some machines I rlogin into (in particular, Crays) don't have any way of using a terminal type not built in, so I can't use tx directly and have to use an xterm running on a SunOS machine. Another (very minor) problem--can the cross-hatching on the bottom of an empty tx window be changed to the solid background color? 1247. Date: Sat, 19 May 90 16:34:31 PDT From: mendel (Mendel Rosenblum) Subject: more fs locking problems Oregano hung up with /tmp locked. The problem appears to be similiar to the consist deadlock we had before. A file in /tmp was being deleted and the RpcServer was hung trying to relock the handle after the consist callbacks in Fsconsist_ClientRemoveCallback(). It differed from the previous deadlock because the handle was locked by a Proc_ServerProc that was writing back a swap file totally unrelated to the file being deleted. It appears there must be some Proc_CallFunc'ed routine that locks a handle but doesn't unlock it before returning. It's a scary thought that /tmp remains locked for the callbacks when deleting a file in /tmp with dirty blocks in a client cache. 1248. Date: Sun, 20 May 90 13:35:01 PDT From: mendel (Mendel Rosenblum) Subject: mint crash 5/20 I had to reboot mint today because it hung and would not respond to the console. It appeared to have been to involved in recovery with several clients at the time. >From our experence with the last couple for reboots of mint it is clear that mint will not recover without some help. I had to put the machines in our office into the debugger and cat mints syslog to /dev/null to get the recovery storm to end. It also indicates that recovery storms are not depended on crashes during high activity. The entire system was basically idle when mint crashed. 1249. Date: Thu, 24 May 90 11:29:24 PDT From: Fred Douglis <douglis> Subject: ds3100 X11R4 server problem if i open a window from a remote host, and that host crashes or reboots when i have the window iconified, then when i try to open the window my server freezes and must be restarted. this worked fine under R3. 1250. Date: Fri, 25 May 90 16:42:12 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: oregano dreadlock oregano died with the same bug mendel found recently: on kvetching, i hit a "consist done" and "remove" getting hung, and then everyone started backing up on /c. the backtrace of who had locked what ended with a Proc_CallFunc having locked something but never unlocked it before finishing whatever it was doing. 1251. Date: Sun, 27 May 90 05:19:18 PDT From: rab (Robert A. Bruce) Subject: migd Migd was writing about 50 messages per second to hijack's console complaining about an undefined ioctl. I put hijack into the debugger. /hosts/hijack/syslog.out was more than 15 Meg. I deleted it because / was full. 1252. Date: Mon, 28 May 90 17:20:07 PDT From: douglis@dill (+) Subject: ds3100 R4 server starts up wrong it behaves as though F1 were pressed. my first keystroke was a D, and the machine went into the debugger. 1253. Date: Mon, 28 May 90 17:40:17 PDT From: douglis@dill (+) Subject: correction: R4 server when restarted the problem with the console being in F1-mode only occurs if i have to kill the server with F1-K and then i restart -- but it isn't just that F1 is still in place, since i can type commands and the F1 doesn't reappear until I start X. this may not be an R4 bug but just a kernel bug that this sequence tweakeed... 1254. Date: Tue, 29 May 90 11:30:19 PDT From: tve (Thorsten von Eicken) Subject: TLB ST miss exception on gluttony (ds3100) Gluttony seems to die nightly with this error at PC 0x800bf8f8. Is that a know problem? Is it worth debugging the next time 'round? 1255. Date: Tue, 29 May 90 16:16:19 PDT From: sequent!fubar@uunet.uu.net Subject: Fsmake computes bogus free space In SetDomainParts, fsmake.c computes "numBlocks" (free space left in the partition) based on the number of cylinders on the entire disk. This is wrong; it should use the number of cylinders in the partition being built. I noticed this when fsmake attempted to put 209000 file descriptors on an 8 meg partition of an 750 meg disk. After this fix, it created 8600 descriptors, a much more reasonable value. 1256. Date: Wed, 30 May 90 10:10:00 PDT From: mendel (Mendel Rosenblum) Subject: Re: Allspice crashed > Allspice crashed last night. Fatal Error: Pmeg lists empty This means it ran out of pmegs. My guess is it was caused by the memory leaks we have in the Sprite kernel. When I did a vmstat on allspice yesterday morning it looked like: MEMORY STATS: Page Size: 8192 Memory Size: 131048 Kernel Memory: 29752 (Code+Data=28608 Stacks=1120 Reserved=24) User Memory: 5008 (Dirty=1912 Clean=3096) FS Memory: 80000 (Min=0 Max=80000) Free Memory: 16248 This means that at least 80000 + 29752 = 107 megabytes of pmegs were in use by the kernel and file cache. There are only 128 megabytes of pmegs available. Here's a rule of thumb for Sprite kernel memory usage: For a machine with N megabytes of memory, the Sprite kernel will expand to N/3 megabytes. oregano up 4 days: Memory size: 16 Meg, Sprite Kernel: 7.2 Meg jaywalk up 7 days: Memory size: 28 Meg, Sprite Kernel: 9.0 Meg allspice up 5 days: Memory size: 128 Meg, Sprite Kernel: 30.0 Meg terrorism up 14 days: Memory size: 28 Meg, Sprite Kernel: 8.7 Meg tyranny up 14 days: Memory size: 26 Meg, Sprite Kernel: 8.1 Meg 1257. Date: Wed, 30 May 90 10:27:37 PDT From: mendel (Mendel Rosenblum) Subject: sparcStation memory loss There is some weirdness in the memory size as reported by the vmstat command on the sun4c. Jaywalk, treason, terrorism, sabotage, and espionage all report 28620 kilobytes of memory. This is 52 kilobytes less than 28 megabytes, the amount of memory in the machine. 52K can probably be explained by memory "stolen" by the PROM. Sage reports 28452 kilobytes; 220 kilobytes less than 28 megabytes. Larceny and tyranny report 26520 and 26184 kilobytes. This is over 2 megabytes less than 28 megabytes. 1258. Date: Wed, 30 May 90 13:36:31 PDT From: tve (Thorsten von Eicken) Subject: gluttony dies again Bad kernel TLB fault at 0x62f0 procptr=0xffffffff TLB ST miss exception at PC 0x800bf8f8 Nobody answered yesterday. Does this look like a software or a hardware failure? If it dies again tomorrow, can I ask someone to debug it? 1259. Date: Wed, 30 May 90 13:41:06 PDT From: shirriff (Ken Shirriff) Subject: random rlogin error I did "rlogin murder" and got "murder.Berkeley.EDU: unknown error (0)". I immediately tried again and it worked. 1260. Date: Wed, 30 May 90 22:55:20 PDT From: tve (Thorsten von Eicken) Subject: gluttony dead AGAIN, can someone please debug!? I need help! Gluttony dies daily. It's dead right now. I bet (can't get to the office) that it's again the same error. Can someone PLEASE spend the time debugging the machine to determine whether it's a kernel bug or a hardware failure? The trap almost certainly is: Bad kernel TLB fault at 0x62f0 Enetering debugger with a TLB ST miss exception at PC 0x800bf8f8 1261. Date: Wed, 30 May 1990 23:50:41 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Re: gluttony dead AGAIN, can someone please debug!? I debugged gluttony for a while and wasn't able to determine anything conclusive. It kind of looks like either a hardware problem, or we are taking an interrupt when we shouldn't be. In the following sequence of code the register %r25 should contain the pointer 0x80105630. Instead it contains 0x563c. It almost looks like the first lui got lost, but not quite (where did the extra 0xc come from?). [Timer_TimerServiceInterrupt:319, 0x800bf8e0] lui r25,0x8010 [Timer_TimerServiceInterrupt:318, 0x800bf8e4] lbu r3,0(r14) [Timer_TimerServiceInterrupt:319, 0x800bf8e8] addiu r25,r25,22064 [Timer_TimerServiceInterrupt:319, 0x800bf8ec] sll r24,r15,2 [Timer_TimerServiceInterrupt:319, 0x800bf8f0] addu r4,r24,r25 [Timer_TimerServiceInterrupt:320, 0x800bf8f4] andi r8,r3,0x40 >*[Timer_TimerServiceInterrupt:319, 0x800bf8f8] sw r0,0(r4) 1262. Date: Thu, 31 May 90 15:12:40 PDT From: Fred Douglis <douglis> Subject: tx properties bug it seems that tx doesn't obey the (new?) standard X convention that puts properties on windows regarding whether they're iconified. if you have a bunch of iconified windows, and restart twm (and have "RestartPreviousState" set in .twmrc), then most applications will get reiconified automatically. tx doesn't. 1263. Date: Thu, 31 May 90 16:06:17 PDT From: Fred Douglis <douglis> Subject: sun4c pmeg thrashing bug larceny went into absolute slo-mo this afternoon after oregano acted up. pmegs getting stolen constantly despite the fact that the fs cache size was only 2MB. shrinking the cache further (fscmd -M 50) sped things up again, but in the meantime it was totally unusable. the ipServer is about 4MB for some reason, and it and Xmfb (1.7MB) and emacs (1.3MB) were all constantly ready. 1264. Date: Fri, 01 Jun 90 10:46:48 PDT From: Fred Douglis <douglis> Subject: kvetching died running vov (different error) It died with a bus error in Fsutil_WaitListInsert. The running process was the newtee process that reads /dev/syslog. the backtrace was: 0 Fsutil_WaitListInsert(list = 0xc02c5ad8, waitPtr = 0xc1247fa8) ["fsNotify.c":65, 0x80085474] > 1 Fsio_DeviceRead(streamPtr = 0xc1247fb4, readPtr = 0xc1247fb8, remoteWaitPtr = 0xc1247fa8, replyPtr = 0xc1247f74) ["fsDevice.c":758, 0x8006a234] 2 .block154 ["fsStreamOps.c":112, 0x80054b30] 3 Fs_Read(streamPtr = 0x80032e80, buffer = 0x10003460, offset = 20229, lenPtr = 0xc1247fec) ["fsStreamOps.c":112, 0x80054b30] 4 Fs_ReadStub(streamID = 4, amountRead = 0, buffer = 0x10003460, amountReadPtr = 0x7ddff8ec) ["fsSysCall.c":319, 0x80055e0c] 5 MachSysCall(0x0, 0x10003460, 0x7ddff8ec, 0x4004a8, 0xfc0c) ["ds3100.md/machAsm.s":1501, 0x800335a8] the point in Fsio_DeviceRead that called Fsutil_WaitListInsert passed it a value that was different inside Fsutil_WaitListInsert. nothing inside that routine is supposed to change this variable. interestingly, the value it got changed to isn't totally off the wall -- it changed from 0xc0198870 to 0xc02c5ad8. i think what finally happened is that when it ran down the bogus list it tried to indirect through a pointer into the code segment. 1265. Date: Thu, 31 May 90 23:04:36 PDT From: sequent!fubar@uunet.uu.net Subject: Re: dev_t == short? >the 4.3 sources here make dev_t a short. (i just checked >on okeeffe and monet.) i'm not sure how to resolve the >discrepancy.... we should probably just discuss this at monday's >meeting. You're right; I was confused. The 4.3 sources do indeed make dev_t a short (include/sys/ is a symlink to /sys/h/, which puts me back in the Dynix source). Under Dynix, though, dev_t is a long. Personally, I think making it a long in Sprite (and fixing up stat(2) & whatever else) would save confusion; I've been bitten by the disappearing 8 bits of unit number several times, most recently today: tar will (silently, unless you use -v) not copy /dev correctly, since /dev/ether* have unit numbers greater than 255. 1266. Date: Fri, 1 Jun 90 11:02:35 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: oregano deadlock yet again the same old thing: #3 0xe063f44 in Sync_SlowWait (...) (...) #4 0xe03aa8a in Fsutil_HandleLockHdr (...) (...) #5 0xe025976 in Fsconsist_ClientRemoveCallback (...) (...) #6 0xe028c5e in Fsio_FileCloseInt (...) (...) #7 0xe02e6fe in DeleteFileName (...) (...) #8 0xe02d6b6 in FslclLookup (...) (...) #9 0xe02cd3c in FslclRemove (...) (...) #10 0xe038892 in Fsrmt_RpcRemove (...) (...) #11 0xe05ee0c in Rpc_Server (...) (...) 1267. Date: Fri, 01 Jun 90 11:30:17 PDT From: sequent!fubar@uunet.UU.NET Subject: Fscheck fails to fix errors In the following, rzd01b is a 30meg partition on a swallow 5, that has just been fsmaked upon. I've seen similar occurances in the past, with fscheck repeatedly complaining about "block count for file xxx wrong; is yy should be zz" over and over on the same partition. Booting single user and "mv xxx tmp; cp tmp xxx" fixes that one. I'm not sure about this one. What units are the "Block 327xx" reported in? If they are filesystem blocks (4K), then they are much too large for the filesystem, which has only 8190 data blocks. elm309 28 # fsmake -dev rzd01 -part b -initialPart c -write Making filesystem for local host, ID = 0x4 MakeFilesystem based on 4K filesystem blocks You are about to overwrite the "(new domain)" filesystem. Do you really want to do this?[y/n] y Disk has 27 tracks/cyl, 81 sectors/track 273 4K Blocks fit on a cylinder with 3 512 byte sectors wasted Reserving 273 blocks for domain header, etc. Domain Header <f8e7d6c5> First Cyl 355, num Cyls 32, raw size 34992 kbytes offset blocks FD Bitmap 273 1 File Desc 274 271 8672 Bitmap 545 1 Data Blocks 546 8190 Geometry sectorsPerTrack 81, numHeads 27 blocksPerRotSet 0, tracksPerRotSet 0 rotSetsPerCyl -1, blocksPerCylinder 273 Offset (Sorted) >> 8672 files, 32760 kbytes "(new domain)" (-1) 32748 Kbytes free, 8668 file descriptors free Attach seconds: 18365 Detach seconds: 18418 elm309 29 # fscheck -bitmapVerbose -dev rzd01 -part b -initialPart c -write -verbose rzd01b: ***** Fscheck ***** rzd01b: Fri Jun 1 11:20:24 1990 rzd01b: Performing recovery check rzd01b: Summary Sector Info: rzd01b: (new domain) domain -1 safe rzd01b: Attach/Detach fields not valid. rzd01b: Fscheck has fixed disk 0 times already. rzd01b: Checking file descriptors: rzd01b: Traversing directory tree: rzd01b: Comparing old and new data block bit maps: rzd01b: Block 32768: old 0 new 4. rzd01b: Block 32776: old 1 new c. rzd01b: Block 32780: old 0 new b. rzd01b: Block 32832: old f new 0. rzd01b: Block 32836: old f new 0. rzd01b: Block 32840: old f new 0. rzd01b: Found error in data block bitmap rzd01b: 3 files, 12 blocks in use, 32748 blocks free, 0 fragments rzd01b: Writing disk elm309 30 # !! fscheck -bitmapVerbose -dev rzd01 -part b -initialPart c -write -verbose rzd01b: ***** Fscheck ***** rzd01b: Fri Jun 1 11:21:03 1990 rzd01b: Performing recovery check rzd01b: Summary Sector Info: rzd01b: (new domain) domain -1 safe rzd01b: Attach/Detach fields not valid. rzd01b: Fscheck has fixed disk 1 times already. rzd01b: Checking file descriptors: rzd01b: Traversing directory tree: rzd01b: Comparing old and new data block bit maps: rzd01b: Block 32768: old 0 new 4. rzd01b: Block 32776: old 1 new c. rzd01b: Block 32780: old 0 new b. rzd01b: Block 32832: old f new 0. rzd01b: Block 32836: old f new 0. rzd01b: Block 32840: old f new 0. rzd01b: Found error in data block bitmap rzd01b: 3 files, 12 blocks in use, 32748 blocks free, 0 fragments rzd01b: Writing disk elm309 31 # 1268. Date: Fri, 01 Jun 90 17:07:01 PDT From: Fred Douglis <douglis> Subject: migration deadlock espionage died with a migration-related deadlock. While one rpc server had the sig monitor locked and was trying to lock a process, another server had the process locked and was trying to get the sig monitor. oops. clearly, brent isn't the only one who succumbed to the "unlock and then lock object" syndrome. i'll try to fix this at my first opportunity. 1269. Date: Mon, 4 Jun 90 00:25:08 PDT From: mgbaker (Mary Gray Baker) Subject: mint crash during shutdown Mint crashed tonight when I tried to shut it down. After printing out "catnip (48) RPC timed out" a few times, and then "Waiting with 4 kernel processes still alive." it printed out "Fatal Error: CallFunc: Process queue full." This may have something to do with my proc call funcs for the negative acknowledgements if a bunch of machines time out on mint while it's shutting down. I'll have to look at that. 1270. Date: Mon, 4 Jun 90 01:00:22 PDT From: mgbaker (Mary Gray Baker) Subject: more about mint crash I forgot to mention that mint hit a breakpoint trap while in the debugger last night. It printed out no other information than that. I hadn't yet attached to it from rosemary. I'd accidentally hit break and then continue from the console, and things seemed fine at first, but this probably messed something up. 1271. Date: 4 Jun 90 09:16:42 PDT (Monday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: oregano deadlock yet again I thought I indicated how to fix the handle lock/consist lock deadlock. The consistency lock and the handle lock should not be held at the same time. 1272. Date: Mon, 4 Jun 90 12:09:15 PDT From: culler (David Culler) Subject: Slipping into the Sprite|Unix crack In attempting to send a file to a printer on a Unix machine, e.g., shallot or guitar, it is not uncommon to see a message of the form: gluttony.Berkeley.EDU: waiting for queue to be enabled on guitar The spooled files just sit in the queue waiting for said enable. Does anybody know what the cause is and how to deal with it? P.S. If you were wondering why the above message came from gluttony rather than cardamom, it is because gluttony has appropriate priviledge on Guitar. Given that sprite functions as a single machine in many ways, it would be nice if it could fool the outside world into viewing it as such. This would simplify .rhosts, mail, and so on. 1273. Date: Mon, 04 Jun 90 12:17:35 PDT From: Fred Douglis <douglis> Subject: Re: Slipping into the Sprite|Unix crack in practice, anything dealing with the internet still views sprite as multiple hosts. mail is the only exception, and even there, mail that goes out as "sprite" has SMTP daemons saying "kvetching, why do you call yourself sprite?" and has mailers all over converting "sprite" into "allspice" since they think that's the canonical form. the problem is that IP connections go to the ipServer on the particular host, using an ethernet address for that host. so for things like rlogin (so .rhosts would just trust "sprite") you'd have to have one host doing all the internet access for the whole cluster. this isn't unheard of, and is okay for logins, but it might be a burden when doing rcp/ftp transfers. with respect to lpd, the simplest solution w.r.t. hostnames may be to redirect all files via allspice, and have outside printers trust allspice. that's already done with the spur unix machines -- they spool to ginger and it passes it on from there. as for the "waiting for queue..." message -- that's been a thorn in our side for months now, and i hope it can be fixed... i presume we can discuss it at today's meeting. 1273. Date: Mon, 4 Jun 90 17:16:29 PDT From: shirriff (Ken Shirriff) Subject: socket bug >From choi@postgres.Berkeley.EDU Mon Jun 4 16:03:40 1990 >i wrote testprograms, ~choi/postgres/test/testparent.c and >~choi/postgres/test/testchild.c, to test whether socket worked or not >between parent and child processes. they ran fine on ultrix, but >don't run on sprite. "testparent" creates a socket, forks off >a child process, which execl's "testchild", then the parent process >does a sento() to the socket. the child process tries to read from >the socket, and if it is successful, it prints the data it has read. >but on sprite "testchild" is blocked on read() and waits forever. >do "make testparent" and "make testchild" and execute the program >by typing "testparent". 1274. Date: Tue, 5 Jun 90 17:49:32 PDT From: shirriff (Ken Shirriff) Subject: Junked mail file My mail file got messed up sometime. Of course, I'm the person I should complain to about this, but I wanted it to be reported. 1275. Date: Tue, 05 Jun 90 17:59:59 PDT From: sequent!fubar@uunet.uu.net Subject: Deadlock in fscache I have an easily recreatable deadlock. The end result looks like the following. I haven't determined exactly how this comes about yet. On some file ("/sprite/src/lib/c/sym.md/libc.a," during a "pmake -J 8 -L 8 -X" on an 8 processor, 40M memory Symmetry): Process A is holding the Fsconsist_Info.lock. It is blocked on the Fs_HandleHeader.unlocked Sync_Condition, waiting to be woken upon. The relevant portion of his stack looks like: _Sched_ContextSwitchInt() from 0x6207a _SyncEventWaitInt+62 _SyncEventWaitInt() from 0x6184a _Sync_SlowWait+8a _Sync_SlowWait() from 0x3cd08 _Fsutil_HandleLockHdr+38 _Fsutil_HandleLockHdr() from 0x276a5 _Fsconsist_GetClientAttrs+cd _Fsconsist_GetClientAttrs() from 0x2e6c4 _FslclGetAttrPath+54 _FslclGetAttrPath() from 0x35d60 _Fsprefix_LookupOperation+bc _Fsprefix_LookupOperation() from 0x19c1b _Fs_GetAttributes+8b _Fs_GetAttributes() from 0x1c0b0 _Fs_CheckAccess+a8 _Fs_CheckAccess() from 0x6455 _MachFetchArgs+3a Process B is holding the Fs_HandleHeader LOCK_HANDLE, and is blocked on the Fsconsist_Info.lock. It appears that the relevant portion of his stack looks like: _IdleLoop() from 0x5f29f _Sched_ContextSwitchInt+e3 _Sched_ContextSwitchInt() from 0x6207a _SyncEventWaitInt+62 _SyncEventWaitInt() from 0x616e0 _Sync_SlowLock+90 _Sync_SlowLock() from 0x61606 _Sync_GetLock+1e _Sync_GetLock() from 0x275f7 _Fsconsist_GetClientAttrs+1f _Fsconsist_GetClientAttrs() from 0x2e6c4 _FslclGetAttrPath+54 _FslclGetAttrPath() from 0x35d60 _Fsprefix_LookupOperation+bc _Fsprefix_LookupOperation() from 0x19c1b _Fs_GetAttributes+8b _Fs_GetAttributes() from 0x1c0b0 _Fs_CheckAccess+a8 _Fs_CheckAccess() from 0x6455 _MachFetchArgs+3a Should LOCK_HANDLE really be called with the LOCK_MONITOR held? This appears (at first glance) to be the cause of the deadlock. 1276. Date: Tue, 5 Jun 90 22:21:13 PDT From: shirriff@sprite.Berkeley.EDU (Ken Shirriff) Subject: Re: Deadlock in fscache The problem with the deadlock here is that LOCK_HANDLE and UNLOCK_HANDLE expand to: LOCK_MONITOR, (un)lock handle, UNLOCK_MONITOR. FslclGetAttrPath locks, unlocks, and relocks the handle. So the final sequence is: L(M), L(H), U(M), L(M), U(H), U(M), L(M), L(H), U(M) 1277. Date: Tue, 5 Jun 90 22:35:56 PDT From: shirriff (Ken Shirriff) Subject: Re: Deadlock in fscache (I'll try again) [I didn't mean to send that out before, since it was incomplete.] The problem with the deadlock here is that Fsutil_HandleLock and Fsutil_HandleUnlock expand to: LOCK_MONITOR, (UN)LOCK_HANDLE, UNLOCK_MONITOR. But FslclGetAttrPath locks, unlocks, and relocks the handle with Fsutil_Handle(un)Lock. So the lock sequence is: (where L(M) locks the monitor, etc) L(M), L(H), U(M), L(M), U(H), U(M), L(M), L(H), U(M) a a b b One process locks the handle and then tries locking the monitor at (a). The other process locks the monitor and then tries locking the handle at (b). As a result they deadlock. I think the handle should be locked like a monitor lock, instead of being protected by a separate monitor lock. Alternatively, I don't think it's necessary to LOCK_MONITOR before unlocking the handle. If the LOCK_MONITOR before UNLOCK_HANDLE is removed, LOCK_MONITOR won't be called with the handle locked, and deadlock can't occur. 1278. Date: 6 Jun 90 10:28:50 PDT (Wednesday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: Deadlock in fscache RIght, we (at least I) understand this deadlock. The fix is to never hold the handle lock while inside the consistency monitor. Currently the locking structure is: 1) do something that gets you a locked handle (fetch, install, LocalLookup) 2) enter the consistency monitor 3) unlock the handle 4) do consistency stuff 5) relock the handle 6) unlock the consistency monitor This approach only reduces the deadlock window to a small one. You should be able to make a pass through fsConsist.c and fix this. Actually, you'll probably want to fix the routines that call into fsConsist.c to unlock the handle first. I've told the folks at Berkeley about this, but they are pondering the whole locking structure before doing anything (apparently). The general rule to guide your fix is to not hold the handle lock while inside the consistency monitor. 1279. Date: 6 Jun 90 10:41:02 PDT (Wednesday) From: "Brent_B._Welch.PARC"@Xerox.COM Subject: Re: Deadlock in fscache (I'll try again) Ken says "I don't think its necessary to LOCK_MONITOR before unlocking the handle". In fact, the proper fix is to never drop into the consistency monitor with a locked handle. In other words, unlock the handle before entering the consistency monitor. You still need the consistnecy monitor, and it has to be a different lock than the handle lock. Think of the handle lock as a lock for initialization purposes, and note that this also includes pathname lookup (a debatable merge). Once the handle has been set up, then it is ok to unlock the handle (while still keeping a reference to it). Now it is ok to drop into the consistency monitor to generate callbacks, etc. 1280. Date: Wed, 6 Jun 90 14:47:02 PDT From: mgbaker (Mary Gray Baker) Subject: mint crash during shutdown I forgot to send mail about this last night. Mint got another fatal error during shutdown last night. It got to the point in the shutdown sequence where it prints out that it's about to sync its disks. Then it printed out "F" on the next line (for Fatal Error) but didn't make it into the debugger. This only seems to happen when mint keeps timing out on the machine that I guess is running the global migration daemon. 1281. Date: Wed, 6 Jun 90 22:26:33 PDT From: shirriff (Ken Shirriff) Subject: Socket bug Does someone want to look at the socket bug the postgres people are having? I don't understand much about sockets. I've simplified the test program to ~shirriff/testsocket.c, which works on rosemary and transmits data. However on sprite it blocks waiting to receive data. The basic concept is: fd = socket(AF_INET, SOCK_DGRAM,0); addr.sin_family = AF_INET; addr.sin_addr.s_addr = INADDR_ANY; bind(fd, &addr, sizeof(addr)); getsockname(fd, &addr, &sizeof(addr)); sendto(fd, data, sizeof(data), 0, &addr, sizeof(addr)); read(fd, &buf, 10); This should open a socket, send data to the socket, and then receive the data, but it blocks in the read. 1282. Date: Thu, 07 Jun 90 11:07:06 PDT From: Fred Douglis <douglis> Subject: there's a whole lotta crashing goin' on between mendel's simulator crashing ds3100s and some recovery-related bug (handles remain, etc.), a lot of machines died yesterday evening: piracy ds3100 down 0+10:32 mace sun3 down 0+10:46 buzz sun3 down 0+10:54 joyride sun4 down 0+12:01 burble sun4 down 0+12:25 catnip sun3 down 0+12:49 crackle sun4 down 0+12:59 terrorism sun4 down 0+13:06 sassafras sun4 down 0+13:09 garlic ds3100 down 0+13:16 fenugreek sun3 down 0+13:23 lsisim ds3100 down 0+16:24 at least one machine that i was about to debug (terrorism) was actually up, but its migration daemon must not be functioning right. mary also sent mail to me regarding getting timeouts doing migrations during pmake. the message was sent at 11:20, which made me think it was recovery-related, especially with her later mail about some machines not recovering. however, the down times for the machines she mentioned imply that they crashed earlier, and not all at once. 1283. Date: Thu, 07 Jun 90 11:20:16 PDT From: Fred Douglis <douglis> Subject: "handles remain" bug solved the fix for incrementing the recovery reference count when migrating devices was added to fsrmt, but apparently after 1.065 was made. that's why a lot of machines started dying, as they panicked after they'd done so many migrations that the reference count went to 0 despite having 150 or so handles to mint. actually, the problem occurred because mint was once again running the global migration daemon, which usually isn't the case. 1284. Date: Thu, 7 Jun 90 12:24:29 PDT From: mendel (Mendel Rosenblum) Subject: Re: Socket bug There is a bug in the ipServer such that datagrams sent to the address INADDR_ANY get dropped on the floor rather than sent to the local host as it is done on Unix. You can patch around this bug by using the address of the local system rather than INADDR_ANY. For example: fd = socket(AF_INET, SOCK_DGRAM,0); addr.sin_family = AF_INET; addr.sin_addr.s_addr = INADDR_ANY; bind(fd, &addr, sizeof(addr)); getsockname(fd, &addr, &sizeof(addr)); { char hostname[256]; struct hostent *hp; gethostname(hostname, 256); hp = gethostbyname(hostname); bcopy(hp->h_addr_list[0], &addr.sin_addr, sizeof(addr.sin_addr)); } sendto(fd, data, sizeof(data), 0, &addr, sizeof(addr)); read(fd, &buf, 10); will work. Also, this is a VERY slow way to communicate on Sprite. The ipServer actually sends the packet! 1285. Date: Fri, 8 Jun 90 01:00:45 -0700 From: choi@postgres.Berkeley.EDU (Ron Choi) Subject: more socket bug in ~choi/postgres/test directory, there are three test programs named "test", "test2", and "testsocket". in order to run these test programs, type "test port_number", where port number is any number between 1025 and 32382. and from another window, type "test2 port_number hostname", where hostname is the name of the host machine. "test" creates a socket and reads a string, "message blah, blah", from it and prints the string to stdout. then it execl's "testsocket". "testsocket" reads a string,"QQQQQQQQQQ", from the socket prints in stdout. "test2" writes 2 strings, first of which is read by "test" and the other read by "testsocket". i ran this on babylon, and "testsocket" prints out "received:F", but it should be "received:FQQQQQQQQQQQQQQQQ". i know these test programs are pretty far-fetched, but this is exactly what postgres does with the sockets it creates. 1286. Date: Fri, 8 Jun 90 10:36:00 PDT From: mendel (Mendel Rosenblum) Subject: Re: more socket bug This is a bug in the file system. The dup2 system call should clear the close-on-exec flag but it doesn't. A simple patch would be to clear the close-on-exec flag after the dup. For example, in routine startup(i) in file test.c change: if((proc_id = fork()) == 0){ { dup2(i, Portfd); printf("new Portfd: %d\n", Portfd); execl("testsocket", "testsocket", 0); } to if((proc_id = fork()) == 0){ { dup2(i, Portfd); #ifdef sprite /* * Patch bug in dup2() on Sprite that doesn't clear * close-on-exec flag. */ (void) fcntl(Portfd, F_SETFD, 0); #endif printf("new Portfd: %d\n", Portfd); execl("testsocket", "testsocket", 0); } Mendel 1287. Date: Sat, 9 Jun 90 18:30:06 PDT From: shirriff (Ken Shirriff) Subject: lpr kills sage Whenever I try to print something on sage, the machine locks up and doesn't respond to the keyboard or ping. This happened with several different kernels (after rebooting). I just noticed the printer was turned off, and it works with the printer on. So the problem is that if the printer's off, the machine wedges up. 1288. Date: Sat, 09 Jun 90 18:33:05 PDT From: tve (Thorsten von Eicken) Subject: Re: lpr kills sage Yep, same thing if there's a tty connected and that's off. I have a tty on crackle and I better not turn it off. Curiously if I hit L1-a and type "c<return>" things work again until some tty buffer is empty/full. 1289. Date: Sun, 10 Jun 90 10:03:43 PDT From: ouster (John Ousterhout) Subject: Corrupted file on /mic The checksummer found the following file to be corrupted this morning: /mic/wlo/AES.3/16d/saram3Tcontrol.sdl allspice-4# fsindex -dev rsd02 -part c saram3Tcontrol.sdl saram3Tcontrol.sdl Desc 48602 size 1588 kbytes 2 version 0: 56648 -1 * 2 frag(s) offset 0 saram3Tcontrol.sdl 1 blocks 1 seeks allspice-5# ls -l saram3Tcontrol.sdl -rw-r--r-- 1 wlo 1588 Mar 22 18:27 saram3Tcontrol.sdl Once again, it's fragment 1 of the block that has been corrupted. The corruption looks like a log of some CAD program. 1290. Date: Sun, 10 Jun 90 18:41:07 PDT From: elm (ethan miller) Subject: mysterious mail file deletion This is a major bug. For some unknown reason, my mail spool file has just disappeared. It happened immediately after I was auto-logged out while using the new tcsh Fred installed on the sun4c. I don't know whether this is a shell bug or a mail bug, but either way, it's pretty serious. If possible, I'd like to get my mail file restored from the latest possible te as well. thanks 1291. Date: Sun, 10 Jun 90 18:44:06 PDT From: Fred Douglis <douglis> Subject: Re: strange problem with tcsh [this is in regard to the strange tcsh problem that ethan encountered. i don't know why it would cause ethan's mail to disappear, but i do know what caused the auto-logout.] turns out there's a problem with setting the policy == 4 -- if you foreground something and it tries to migrate, it sets an alarm and then does the wrong action trying to handle the alarm (thinks its autologout mechanism has kicked in). i will attempt to disable auto-logout completely to fix that problem and then install a new tcsh. 1292. Date: Sun, 10 Jun 90 19:25:58 PDT From: douglis (Fred Douglis) Subject: restoring files I wanted to restore ethan's mail file for him without him having to wait for bob to be around. problem is, i see nothing in the restore man page or the howto file that would explain how to restore a file to a different path. since i don't want to risk clobbering ethan's mailbox with old mail superseding new mail, i decided not to touch it. is there a way to do this, and if not, can we make it so there is? if so, can this be added to the appropriate files? from what i can see, the howto file is extremely out of date. 1293. Date: Sun, 10 Jun 1990 20:37:23 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: hijack problems Hijack has been exhibiting behavior where it goes into a tight loop complaining about a tlb fault or something (its hard to read the console since its flying by so fast). It does not go into the debugger. When I reset it it fails diagnostics. Typing "t d" (disk ram test) fails. Perhaps a repair person should look at it. I'm not sure if the two problems are related -- lets fix the latter and see if the former goes away. 1294. Date: Mon, 11 Jun 90 09:56:15 PDT From: Fred Douglis <douglis> Subject: gluttony crashes i asked johnw why he was running a program in an infinite loop to keep his machine busy. his response: ------- Forwarded Message Date: Mon, 11 Jun 90 09:22:13 -0700 From: johnw@sprite.Berkeley.EDU (John Wawrzynek) To: douglis@sprite.Berkeley.EDU Subject: Re: busy gluttony I've been running ~johnw/misc/busy.c to see if keeping the processor busy has anything to do with a series of TLB faults at 800bf8f8. In fact, I've had no crases since I've been running it. Sorry to throw off your statistics. I will gladly turn if off if there is someone who can look at my crashes. Thanks. - -JohnW ------- End of Forwarded Message gluttony is no longer running the vov master, i think. but it might still have been crashing due to other things. (kvetching died this morning around 8 -- haven't yet debugged it but will in a moment. probably someone else migrating onto it, but i'll see.) in any case, the question is whether gluttony is staying up because no one is running anything interesting on it anymore (w/ the load so high no one will migrate onto it) or whether keeping the processor busy really does make a difference. 1295. Date: Mon, 11 Jun 90 10:35:17 PDT From: Fred Douglis <douglis> Subject: ds3100 TLB crash kvetching died in Fs_SelectStub with the indirection through streamPtr->ioHandlePtr->fileID.type].select. the running process was inetd, and it died because the streamPtr corresponding to netTCP had a NIL ioHandlePtr. 1296. Date: Mon, 11 Jun 90 11:32:50 PDT From: Fred Douglis <douglis> Subject: hung wall rpcs wedge system treason has been hung up since sometime last night. i took an initial stab at debugging it, thinking maybe it was the migration deadlock i sent mail about a few days ago, but it turns out to be because of all the wall's you've done since treason rebooted! there are 8 RPC servers wedged up with hung RPCs to rlogin pdevs. As a result, other stuff is wedged up because they can't get RPC channels. also, a lot of rpc servers are in Recov_HostAlive waiting for someone else to complete recovery, i think. $X# Date: Mon, 11 Jun 90 17:55:09 PDT From: shirriff (Ken Shirriff) Subject: Re: Deadlock in fscache > From: sequent!fubar@uunet.uu.net > Right now I get (twice so far): > 03: Fatal Error: Fsio_LocalFileHandleInit, found handle with no descPtr I've looked through the code, and the only way I can see this is in Fsio_FileCloseInt. It does the following: (void)Fslcl_DeleteFileDesc(handlePtr); if (callback) { Fsutil_HandleRelease(handlePtr, TRUE); } Fsutil_HandleRemove(handlePtr); This deletes the descriptor and sets descPtr = NIL. Then it unlocks and releases the handle. Then it removes the handle. If another process grabs the handle after it's unlocked, but before it's been removed, it will get the handle with descPtr = NIL. I think the fix is to remove the handle and then release it with Fsutil_HandleRelease(handlePtr, FALSE); $X# Date: Tue, 12 Jun 90 19:25:30 PDT From: schauser (Klaus Erik Schauser) Subject: Paprika goes down Paprika goes down very often when running the new X11R4. It just stops accepting any input. Even when pressing L1 a, no input will be accepted, so to reboot it we always need to switch it off. We would very much appreciate if you could take a look at it, because it is not possible to work longer than 20 min at the moment before it happens. $X# Date: Thu, 14 Jun 90 12:16:54 PDT From: culler (David Culler) Subject: ioctl: bad command TIOCNOTTY Shows up when on: enscript -2r -G -Pms [filename] $X# Date: Fri, 15 Jun 90 14:29:59 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: oregano out of memory oregano crashed, and this time i was able to print a dump of its memory. the relevant counters: size in use 24 44558 32 28623 48 4494 72 3471 88 2921 136 1132 328 1132 total alloc: 4784384, freed 935528. the large objects were not allocated much, only 2 or 3 max. $X# Date: Sat, 16 Jun 90 10:51:27 PDT From: ouster (John Ousterhout) Subject: Mint crash, plus bad reboot instructions Mint was catatonic when I came in this morning. It took two reboots to get it up again. The first reboot hung part-way through recovery after printing a mysterious dangling "F" on the syslog (sounds a bit like the deadlocks during shutdown?). Mary, can you fix the reboot instructions you left on Mint's console? They vaguely mention a kernel "sun3.MB.084 on ginger" but don't give any instructions about how to actually boot this beast. I first had to deduce the machine address (0,961c,43), and then had to login to Ginger and probe around to discover that the kernel is in tmp/sun3.MB.084. Although I was able to figure this out, I doubt that a non-Spriter would have been able to figure it out. Can you redo the note to be much more explicit about EXACTLY what to do to reboot? Also, at this point I think the note probably shouldn't mention sun3.new as a fall-back kernel; given the recovery storm problems, it won't get through recovery, will it? $X# Date: Sat, 16 Jun 90 15:45:19 PDT From: ouster (John Ousterhout) Subject: File corruption While we were at USENIX another file corruption occurred, in file /sprite/lib/fonts/pk/ilcmssi8.746pk. The file inherited a piece of a mail message (actually, the entire message) from John Hartman. Unfortunately, the actual corruption seems to have occured on Monday, at a time when Mint was running a kernel without instrumentation, so there was nothing about the file in the trace logs. As usual, the corruption occurred in the second fragment of a block: mint-2# ls -l ilcmssi8.746pk -rw-rw-r-- 1 root 1464 Oct 25 1987 ilcmssi8.746pk mint-3# fsindex -dev rxy0 -part g ilcmssi8.746pk ilcmssi8.746pk Desc 58529 size 1464 kbytes 2 version 0: 44268 -1 * 2 frag(s) offset 0 ilcmssi8.746pk 1 blocks 1 seeks $X# Date: Sat, 16 Jun 90 16:05:43 PDT From: ouster (John Ousterhout) Subject: Another file corruption on /mic /mic/tve/Mail/ARRIVAL/#32 was corrupted while we were gone. The end of the file inherited the following data: Return-Path: <ucbvax!scam.Berkeley.EDU!latta> Received: by cnmat (NeXT-1.0 (From Sendmail 5.52)/NeXT-1.0) id AA03048; Tue, 12 Jun 90 15:31:42 GMT-0800 id id6920 ms bc1b-81 folio 157v stanza 2 file bc1t6 line 210 %%%%% Looks like some sort of log output (music-related?). The usual fsindex information is below. Bob, can you restore this file from backup? allspice-3# cd /mic/tve/Mail/ARRIVAL allspice-4# fsindex -dev rsd02 -part c #32 #32 Desc 104636 size 1453 kbytes 2 version 0: 24240 -1 * 2 frag(s) offset 0 #32 1 blocks 1 seeks allspice-5# ls -l #32 -rw-r--r-- 1 tve 1453 Jun 12 19:52 #32 $X# Date: Sun, 17 Jun 90 21:53:34 PDT From: mgbaker (Mary Gray Baker) Subject: mint crash I think we forgot to mail this out. Mint crashed this afternoon with disk errors. When we tried to reboot it, we got a Fatal Error that turns out not to be a deadlock. The call-back queue is getting full because of a probable bug in my NegAck stuff and because processes are piling up waiting for an interrupt to tell them their network packets have been sent. $X# Date: Mon, 18 Jun 90 13:33:58 PDT From: tve (Thorsten von Eicken) Subject: assault crash Assault crashed this morning with a TLB load address error exception at PC0x800c06c0. I rebooted it. The instructions are not clear: am I supposed to reboot tftp()new (as printed n the sheet) or with rz()ds3100 (as hand-written in big letters on the same sheet). Just for grins I used neither and rebooted with rz()new... (hehe) That booted 1.065, is that old? Also, I wondered why assault checks all its disks sequentially? It takes forever!@#$%^&*( $X# Date: Tue, 19 Jun 90 09:38:24 PDT From: tve (Thorsten von Eicken) Subject: allspice has very high load Allspice's load average seems pretty high: always around 1. I did a ps to see who was using the cput time and about 50% were chewed-up by ntalkd. However the total cpu minutes used by ntalkd were not very high. A few minutes later, the original ntalkd had died and a new one was using the same 50% cpu. Mhhhhhhh, bizarre. Well, let's see if it calms down afetr today's shutdown (machine room work). Just thought I'd signal this... $X# Date: Wed, 20 Jun 90 11:44:19 PDT From: stolcke (Andreas Stolcke) Subject: popen() not declared in stdio.h The subject line says it all. $X# Date: Wed, 20 Jun 90 18:19:57 PDT From: Fred Douglis <douglis> Subject: Re: allspice has very high load i've seen this before; thought i'd reported it but perhaps not. it has required killing off and restarting inetd. we should definitely try to track down inetd's flakiness... $X# Date: Thu, 21 Jun 90 08:50:01 PDT From: ouster (John Ousterhout) Subject: Corrupted file This time it was a file on /user1: /user1/sah/mail/ernie/sumjob/IBMiannuci. allspice-2# ls -l IBMiannuci -rw-r--r-- 1 sah 30 Jun 14 16:18 IBMiannuci allspice-3# fsindex -dev rsd01 -part c IBMiannuci IBMiannuci Desc 64926 size 30 kbytes 1 version 0: 225081 -1 * 1 frag(s) offset 1 IBMiannuci 1 blocks 1 seeks $X# Date: Thu, 21 Jun 90 09:18:44 PDT From: ouster (John Ousterhout) Subject: Corrupted file /mic/johnw/dpp/ncube/msg/justsend.c got the shaft this time: weird-looking numbers in the second (last) fragment of the file. If this keeps up I may have to start tracing /mic allocations. allspice-8# ls -l justsend.c -rw-rw-r-- 1 johnw 1514 Jun 15 20:53 justsend.c allspice-9# fsindex -dev rsd02 -part c justsend.c justsend.c Desc 136018 size 1514 kbytes 2 version 0: 40 -1 * 2 frag(s) offset 0 justsend.c 1 blocks 1 seeks allspice-10# tail justsend.c char *buf; /* Message data */ int i; /* Index */ int repeat=100; /* Number of times to repeat message */ double single; /* Time for message in microseconds */ int debug = 0; /* Flag to print debug messages */ char *getenv(); /* Get runtime environment */ whoami(&me, &proc, &host, &alt.sex.bondage 645814928 21452 $X# Date: Thu, 21 Jun 1990 18:32:27 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: rdate broken? I've noticed that a couple of machines were about 5 minutes out of sync with mint. Is the rdate in the crontab not working? $X# Date: Thu, 21 Jun 90 22:26:53 PDT From: Fred Douglis <douglis> Subject: problem installing symlinks in kernel area when doing "make install" in a directory such as libc, with symlinks to ../../lib/net/*.c, the symlink is copied verbatim, so it no longer points to a valid file. then the snapshot script tries to follow the link and can't. $X# Date: Fri, 22 Jun 90 12:19:09 PDT From: shirriff (Ken Shirriff) Subject: pdev.c problems The Pdev routines in /sprite/src/lib/c/etc/pdev.c have various problems. In particular, some of the default handlers take the wrong number of arguments, or expect the wrong type of argument (see lint for details). Things probably work since most of these arguments aren't used, but someone who knows the pdev stuff should make sure everything is right. $X# Date: Fri, 22 Jun 90 14:49:32 PDT From: root (The Sprite God) Subject: allspice inetd restarted someone complained that finger @sprite was refused. restarting inetd fixed the problem. this is one flaky program. $X# Date: Fri, 22 Jun 1990 15:42:36 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: kgdb.sun3 broken I frequently cannot print out the contents of a variable when debugging a sun3. I get something similar to the following: (gdb) p virtPage No symbol "virtPage" in current context. $X# Date: Fri, 22 Jun 90 15:47:20 PDT From: Fred Douglis <douglis> Subject: kgdb.sun4 broken too last night, i spent some unconscionable amount of time trying to debug my migration fix, because the line at which i was told the error occurred was actually one or two statements before the line that actually caused the problem. furthermore, the register variable i was looking at claimed to be 0, which would have explained the bad pointer i hit. so, i spent my time trying to figure out how the pointer suddenly became 0 rather than seeing that it was a completely different pointer hitting the error. that's ALMOST as bad as trying to use kdbx on a ds3100. but not quite.... $X# Date: Fri, 22 Jun 90 15:50:51 PDT From: tve (Thorsten von Eicken) Subject: Re: kgdb.sun4 broken too Did you compile with the -O flag? If you did (default in the makefile) you should know that debugging is approximate. I.e. things might not get executed in the same order as in the source. I just would like to point out that all these problems are in the normal gdb too. Plus problems with continuing after breakpoints (looks like jhh's fix a few weeks ago fixed dbx and broke gdb?). $X# Date: Fri, 22 Jun 90 16:17:18 PDT From: tve (Thorsten von Eicken) Subject: can't mail to davidson.cs.sandia.gov - host unknown Is the mailer brain-damaged, or am I expecting too much? ------- Forwarded Message Return-Path: tve Received: by sprite.Berkeley.EDU (5.59/1.29) id AA997247; Fri, 22 Jun 90 16:15:34 PDT Date: Fri, 22 Jun 90 16:15:34 PDT From: MAILER-DAEMON (Mail Delivery Subsystem) Subject: Returned mail: Host unknown Message-Id: <9006222315.AA997247@sprite.Berkeley.EDU> To: tve ----- Transcript of session follows ----- 550 gsdavid@davidson.cs.sandia.gov... Host unknown ----- Unsent message follows ----- Received: by sprite.Berkeley.EDU (5.59/1.29) id AA931709; Fri, 22 Jun 90 16:15:34 PDT Date: Fri, 22 Jun 90 16:15:34 PDT From: tve (Thorsten von Eicken) Message-Id: <9006222315.AA931709@sprite.Berkeley.EDU> To: gsdavid@davidson.cs.sandia.gov Subject: test ------- End of Forwarded Message $X# Date: Fri, 22 Jun 90 17:16:58 PDT From: elm (ethan miller) Subject: memory leak? After quite a while (7+ days), my SparcStation gets extremely slow when it's running X. Exiting X and restarting it doesn't do any good. Only restarting the machine seems to help. I checked, and it wasn't being caused by migrated processes or anything else like that. This isn't the first time I've noticed it, either. I guess it's not serious, but someone should know about it. 1297. Date: Mon, 11 Jun 90 17:55:09 PDT From: shirriff (Ken Shirriff) Subject: Re: Deadlock in fscache > From: sequent!fubar@uunet.uu.net > Right now I get (twice so far): > 03: Fatal Error: Fsio_LocalFileHandleInit, found handle with no descPtr I've looked through the code, and the only way I can see this is in Fsio_FileCloseInt. It does the following: (void)Fslcl_DeleteFileDesc(handlePtr); if (callback) { Fsutil_HandleRelease(handlePtr, TRUE); } Fsutil_HandleRemove(handlePtr); This deletes the descriptor and sets descPtr = NIL. Then it unlocks and releases the handle. Then it removes the handle. If another process grabs the handle after it's unlocked, but before it's been removed, it will get the handle with descPtr = NIL. I think the fix is to remove the handle and then release it with Fsutil_HandleRelease(handlePtr, FALSE); 1298. Date: Tue, 12 Jun 90 19:25:30 PDT From: schauser (Klaus Erik Schauser) Subject: Paprika goes down Paprika goes down very often when running the new X11R4. It just stops accepting any input. Even when pressing L1 a, no input will be accepted, so to reboot it we always need to switch it off. We would very much appreciate if you could take a look at it, because it is not possible to work longer than 20 min at the moment before it happens. 1299. Date: Thu, 14 Jun 90 12:16:54 PDT From: culler (David Culler) Subject: ioctl: bad command TIOCNOTTY Shows up when on: enscript -2r -G -Pms [filename] 1300. Date: Fri, 15 Jun 90 14:29:59 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: oregano out of memory oregano crashed, and this time i was able to print a dump of its memory. the relevant counters: size in use 24 44558 32 28623 48 4494 72 3471 88 2921 136 1132 328 1132 total alloc: 4784384, freed 935528. the large objects were not allocated much, only 2 or 3 max. 1301. Date: Sat, 16 Jun 90 10:51:27 PDT From: ouster (John Ousterhout) Subject: Mint crash, plus bad reboot instructions Mint was catatonic when I came in this morning. It took two reboots to get it up again. The first reboot hung part-way through recovery after printing a mysterious dangling "F" on the syslog (sounds a bit like the deadlocks during shutdown?). Mary, can you fix the reboot instructions you left on Mint's console? They vaguely mention a kernel "sun3.MB.084 on ginger" but don't give any instructions about how to actually boot this beast. I first had to deduce the machine address (0,961c,43), and then had to login to Ginger and probe around to discover that the kernel is in tmp/sun3.MB.084. Although I was able to figure this out, I doubt that a non-Spriter would have been able to figure it out. Can you redo the note to be much more explicit about EXACTLY what to do to reboot? Also, at this point I think the note probably shouldn't mention sun3.new as a fall-back kernel; given the recovery storm problems, it won't get through recovery, will it? 1302. Date: Sat, 16 Jun 90 15:45:19 PDT From: ouster (John Ousterhout) Subject: File corruption While we were at USENIX another file corruption occurred, in file /sprite/lib/fonts/pk/ilcmssi8.746pk. The file inherited a piece of a mail message (actually, the entire message) from John Hartman. Unfortunately, the actual corruption seems to have occured on Monday, at a time when Mint was running a kernel without instrumentation, so there was nothing about the file in the trace logs. As usual, the corruption occurred in the second fragment of a block: mint-2# ls -l ilcmssi8.746pk -rw-rw-r-- 1 root 1464 Oct 25 1987 ilcmssi8.746pk mint-3# fsindex -dev rxy0 -part g ilcmssi8.746pk ilcmssi8.746pk Desc 58529 size 1464 kbytes 2 version 0: 44268 -1 * 2 frag(s) offset 0 ilcmssi8.746pk 1 blocks 1 seeks 1303. Date: Sat, 16 Jun 90 16:05:43 PDT From: ouster (John Ousterhout) Subject: Another file corruption on /mic /mic/tve/Mail/ARRIVAL/#32 was corrupted while we were gone. The end of the file inherited the following data: Return-Path: <ucbvax!scam.Berkeley.EDU!latta> Received: by cnmat (NeXT-1.0 (From Sendmail 5.52)/NeXT-1.0) id AA03048; Tue, 12 Jun 90 15:31:42 GMT-0800 id id6920 ms bc1b-81 folio 157v stanza 2 file bc1t6 line 210 %%%%% Looks like some sort of log output (music-related?). The usual fsindex information is below. Bob, can you restore this file from backup? allspice-3# cd /mic/tve/Mail/ARRIVAL allspice-4# fsindex -dev rsd02 -part c #32 #32 Desc 104636 size 1453 kbytes 2 version 0: 24240 -1 * 2 frag(s) offset 0 #32 1 blocks 1 seeks allspice-5# ls -l #32 -rw-r--r-- 1 tve 1453 Jun 12 19:52 #32 1304. Date: Sun, 17 Jun 90 21:53:34 PDT From: mgbaker (Mary Gray Baker) Subject: mint crash I think we forgot to mail this out. Mint crashed this afternoon with disk errors. When we tried to reboot it, we got a Fatal Error that turns out not to be a deadlock. The call-back queue is getting full because of a probable bug in my NegAck stuff and because processes are piling up waiting for an interrupt to tell them their network packets have been sent. 1305. Date: Mon, 18 Jun 90 13:33:58 PDT From: tve (Thorsten von Eicken) Subject: assault crash Assault crashed this morning with a TLB load address error exception at PC0x800c06c0. I rebooted it. The instructions are not clear: am I supposed to reboot tftp()new (as printed n the sheet) or with rz()ds3100 (as hand-written in big letters on the same sheet). Just for grins I used neither and rebooted with rz()new... (hehe) That booted 1.065, is that old? Also, I wondered why assault checks all its disks sequentially? It takes forever!@#%%^&*( 1306. Date: Tue, 19 Jun 90 09:38:24 PDT From: tve (Thorsten von Eicken) Subject: allspice has very high load Allspice's load average seems pretty high: always around 1. I did a ps to see who was using the cput time and about 50% were chewed-up by ntalkd. However the total cpu minutes used by ntalkd were not very high. A few minutes later, the original ntalkd had died and a new one was using the same 50% cpu. Mhhhhhhh, bizarre. Well, let's see if it calms down afetr today's shutdown (machine room work). Just thought I'd signal this... 1307. Date: Wed, 20 Jun 90 11:44:19 PDT From: stolcke (Andreas Stolcke) Subject: popen() not declared in stdio.h The subject line says it all. 1308. Date: Wed, 20 Jun 90 18:19:57 PDT From: Fred Douglis <douglis> Subject: Re: allspice has very high load i've seen this before; thought i'd reported it but perhaps not. it has required killing off and restarting inetd. we should definitely try to track down inetd's flakiness... 1309. Date: Thu, 21 Jun 90 08:50:01 PDT From: ouster (John Ousterhout) Subject: Corrupted file This time it was a file on /user1: /user1/sah/mail/ernie/sumjob/IBMiannuci. allspice-2# ls -l IBMiannuci -rw-r--r-- 1 sah 30 Jun 14 16:18 IBMiannuci allspice-3# fsindex -dev rsd01 -part c IBMiannuci IBMiannuci Desc 64926 size 30 kbytes 1 version 0: 225081 -1 * 1 frag(s) offset 1 IBMiannuci 1 blocks 1 seeks 1310. Date: Thu, 21 Jun 90 09:18:44 PDT From: ouster (John Ousterhout) Subject: Corrupted file /mic/johnw/dpp/ncube/msg/justsend.c got the shaft this time: weird-looking numbers in the second (last) fragment of the file. If this keeps up I may have to start tracing /mic allocations. allspice-8# ls -l justsend.c -rw-rw-r-- 1 johnw 1514 Jun 15 20:53 justsend.c allspice-9# fsindex -dev rsd02 -part c justsend.c justsend.c Desc 136018 size 1514 kbytes 2 version 0: 40 -1 * 2 frag(s) offset 0 justsend.c 1 blocks 1 seeks allspice-10# tail justsend.c char *buf; /* Message data */ int i; /* Index */ int repeat=100; /* Number of times to repeat message */ double single; /* Time for message in microseconds */ int debug = 0; /* Flag to print debug messages */ char *getenv(); /* Get runtime environment */ whoami(&me, &proc, &host, &alt.sex.bondage 645814928 21452 1311. Date: Thu, 21 Jun 1990 18:32:27 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: rdate broken? I've noticed that a couple of machines were about 5 minutes out of sync with mint. Is the rdate in the crontab not working? 1312. Date: Thu, 21 Jun 90 22:26:53 PDT From: Fred Douglis <douglis> Subject: problem installing symlinks in kernel area when doing "make install" in a directory such as libc, with symlinks to ../../lib/net/*.c, the symlink is copied verbatim, so it no longer points to a valid file. then the snapshot script tries to follow the link and can't. 1313. Date: Fri, 22 Jun 90 12:19:09 PDT From: shirriff (Ken Shirriff) Subject: pdev.c problems The Pdev routines in /sprite/src/lib/c/etc/pdev.c have various problems. In particular, some of the default handlers take the wrong number of arguments, or expect the wrong type of argument (see lint for details). Things probably work since most of these arguments aren't used, but someone who knows the pdev stuff should make sure everything is right. 1314. Date: Fri, 22 Jun 90 14:49:32 PDT From: root (The Sprite God) Subject: allspice inetd restarted someone complained that finger @sprite was refused. restarting inetd fixed the problem. this is one flaky program. 1315. Date: Fri, 22 Jun 1990 15:42:36 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: kgdb.sun3 broken I frequently cannot print out the contents of a variable when debugging a sun3. I get something similar to the following: (gdb) p virtPage No symbol "virtPage" in current context. 1316. Date: Fri, 22 Jun 90 15:47:20 PDT From: Fred Douglis <douglis> Subject: kgdb.sun4 broken too last night, i spent some unconscionable amount of time trying to debug my migration fix, because the line at which i was told the error occurred was actually one or two statements before the line that actually caused the problem. furthermore, the register variable i was looking at claimed to be 0, which would have explained the bad pointer i hit. so, i spent my time trying to figure out how the pointer suddenly became 0 rather than seeing that it was a completely different pointer hitting the error. that's ALMOST as bad as trying to use kdbx on a ds3100. but not quite.... 1317. Date: Fri, 22 Jun 90 15:50:51 PDT From: tve (Thorsten von Eicken) Subject: Re: kgdb.sun4 broken too Did you compile with the -O flag? If you did (default in the makefile) you should know that debugging is approximate. I.e. things might not get executed in the same order as in the source. I just would like to point out that all these problems are in the normal gdb too. Plus problems with continuing after breakpoints (looks like jhh's fix a few weeks ago fixed dbx and broke gdb?). 1318. Date: Fri, 22 Jun 90 16:17:18 PDT From: tve (Thorsten von Eicken) Subject: can't mail to davidson.cs.sandia.gov - host unknown Is the mailer brain-damaged, or am I expecting too much? ------- Forwarded Message Return-Path: tve Received: by sprite.Berkeley.EDU (5.59/1.29) id AA997247; Fri, 22 Jun 90 16:15:34 PDT Date: Fri, 22 Jun 90 16:15:34 PDT From: MAILER-DAEMON (Mail Delivery Subsystem) Subject: Returned mail: Host unknown Message-Id: <9006222315.AA997247@sprite.Berkeley.EDU> To: tve ----- Transcript of session follows ----- 550 gsdavid@davidson.cs.sandia.gov... Host unknown ----- Unsent message follows ----- Received: by sprite.Berkeley.EDU (5.59/1.29) id AA931709; Fri, 22 Jun 90 16:15:34 PDT Date: Fri, 22 Jun 90 16:15:34 PDT From: tve (Thorsten von Eicken) Message-Id: <9006222315.AA931709@sprite.Berkeley.EDU> To: gsdavid@davidson.cs.sandia.gov Subject: test ------- End of Forwarded Message 1319. Date: Fri, 22 Jun 90 17:16:58 PDT From: elm (ethan miller) Subject: memory leak? After quite a while (7+ days), my SparcStation gets extremely slow when it's running X. Exiting X and restarting it doesn't do any good. Only restarting the machine seems to help. I checked, and it wasn't being caused by migrated processes or anything else like that. This isn't the first time I've noticed it, either. I guess it's not serious, but someone should know about it. 1320. Date: Mon, 25 Jun 90 16:04:59 PDT From: tve (Thorsten von Eicken) Subject: clock out of synch? It seems the noon mint problems (with machines being halted) got many clocks out of sync. Is there a way to get the rdates to run soon? 1321. Date: Mon, 25 Jun 90 16:13:42 PDT From: Fred Douglis <douglis> Subject: Re: clock out of synch? if you know what time most of the machines think it is, you can edit /sprite/lib/cron/crontab to add an entry for a short time from "now" that will cause them to invoke rdate prematurely once they re-stat the file. i think they stat crontab once per minute. 1322. Date: Mon, 25 Jun 90 16:45:49 PDT From: Fred Douglis <douglis> Subject: sun4c reboot fails about 50% of the time, it seems, "shutdown -R ..." hangs on the sun4c. it starts downloading and then just sits there. it's necessary to l1-a, or often, power-cycle the machine and then try again. 1323. Date: Mon, 25 Jun 90 22:16:59 PDT From: tve (Thorsten von Eicken) Subject: clock syncronization script broken? crackle-5# whoami root crackle-6# cat /sprite/admin/Rdate #!/sprite/cmds/csh -f if (`hostname` =~ mint*) exit #set id=`hostname -i` #@ id *= 5 #sleep %id rdate mint crackle-7# /sprite/admin/Rdate rdate: connect: connection refused 1324. Date: Tue, 26 Jun 90 00:22:40 PDT From: Fred Douglis <douglis> Subject: Re: clock syncronization script broken? alas, this is the same thing that has been reported repeatedly over the past few weeks. inetd gets fried. to fix the problem, one must kill mint's inetd and restart it. this was mentioned in today's bug report session but skipped over without a discussion of a body to investigate it. i'll try to take a peek. 1325. Date: Tue, 26 Jun 90 12:16:47 PDT From: mgbaker (Mary Gray Baker) Subject: rpc problem? I just got all this in my console. Did anybody else get something like this recently? RpcResend: RPC 23, client 14, RPC seq # 1491a7, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491a7, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491a7, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491a7, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491a7, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491a7, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491a7, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491b0, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491b0, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491b0, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491b0, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491b0, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491b0, forgot reply? RpcResend: RPC 23, client 14, RPC seq # 1491b0, forgot reply? get attr of "Makefile" waiting for recovery 6/26/90 12:13:52 allspice (14) RmtFile "/sprite/src/kernel/mgbaker" <6,23748> : stale handle 6/26/90 12:13:52 allspice (14) - recovering handles 6/26/90 12:13:54 allspice (14) Recovery complete 242 handles reopened 1326. Date: Tue, 26 Jun 90 13:52:20 PDT From: shirriff (Ken Shirriff) Subject: Mint crash Mint crashed today with: Fsconsist_IOClientClose: client 44 ref 0 write 0 exec -1. This is the same crash as yesterday. Today the file was "spritehosts"; yesterday it was "crontab". Mint was running the 1.066 kernel. Client 44 is mustard. Both times, before the crash, mustard went into an infinite "entering debugger" loop, so Pete rebooted it. Mint crashed shortly after mustard finished booting. My guess is that the use count is getting mangled in recovery with mustard (maybe in Fsconsist_ReopenClient) but this doesn't cause a panic until the file is closed. 1327. Date: Tue, 26 Jun 90 16:43:43 PDT From: mendel (Mendel Rosenblum) Subject: Bug in Mach_ContextSwitch on sparc machines This is a bug in Mach_ContextSwitch for the sparc machines that can causes random trashing of user processes' stacks. The algorithm used by Mach_ContextSwitch is Switch VM context to new process. Save global and floating point regs. for (i = 0; i < NUM_REG_WINDOWS; i++) save; /* Spill register windows to stack. */ for (i = 0; i < NUM_REG_WINDOWS; i++) restore; /* Return to the correct window. */ Restore global and floating point regs of new process. Restore current window. return to caller. The problem is that "save" instructions cause overflow faults that can spill windows to the user's stack. Changing the VM context before the spills means that the registers are spilled to the wrong process's stack. Unless I'm totally misunderstanding something here, I think the switching of VM context should be done after saving the old processes state and before loading the new processes state. Since the sun4 ports appear to run as well as any Sprite port, I think it is correct to conclude that there are almost always at least 6 levels of call nesting between a user's trap into the kernel and Mach_ContextSwitch being called. 1328. Date: Wed, 27 Jun 90 12:04:34 PDT From: shirriff (Ken Shirriff) Subject: Migration bug subversion crashed with a TLB LD miss, running 1.066. The problem is in GetProcEncapSize, which does Byte_AlignAddr(strlen(procPtr->argString) + 1); procPtr->argString was NULL, so strlen died. 1329. Date: Wed, 27 Jun 90 12:22:27 PDT From: Fred Douglis <douglis> Subject: Re: Migration bug >>>>> On Wed, 27 Jun 90 12:04:34 PDT, shirriff@sprite.Berkeley.EDU (Ken Shirriff) said: Ken> subversion crashed with a TLB LD miss, running 1.066. Ken> The problem is in GetProcEncapSize, which does Ken> Byte_AlignAddr(strlen(procPtr->argString) + 1); Ken> procPtr->argString was NULL, so strlen died. i believe this is due to jay's change to procExec i installed a few days ago. it saves away the argString so it can set it back on error and avoid freeing it twice. however, there's a pathological case in which it can jump to the error handling code before it's set argStringSave. later on the proc can wind up with a NIL argString. 1330. Date: Wed, 27 Jun 1990 13:36:53 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: chmod on remote links broken If you try to chmod a remote link you get the following message: chmod: can't change /b, Too many levels of symbolic links 1331. Date: Thu, 28 Jun 90 17:22:35 PDT From: Fred Douglis <douglis> Subject: those debuggable processes i just spent quite some time trying to figure out why the migrating tcsh was complaining along the lines of "/sprite/cmds.%MACHINE/la: invalid argument" when i happened to migrate to forgery. It didn't occur to me to do a ps on that machine, and instead i stepped through the debugger to see where exec was bailing. turns out it was running out of segments and/or processes, due mostly to the presence of dozens of raidSim4 processes in the DEBUG state. i'm going to kill off the debug processes so things can get moving again. 1332. Date: Fri, 29 Jun 90 16:43:10 PDT From: shirriff (Ken Shirriff) Subject: Allspice problems Allspice has hung up a couple times today. Apparently the network interface is flaky; Allspice's console had a bunch of RPC to foo is hung messages and it worked after I did L1-N. Also, the ribbon in mint's console is rapidly heading towards oblivion. 1333. Date: Fri, 29 Jun 90 16:51:05 PDT From: Fred Douglis <douglis> Subject: Re: problem with msgs? This was a problem with oregano running a new kernel that didn't byte-swap seek operations and would return an error. (it used to just return SUCCESS, i guess.) larceny is now exporting /sprite2 with a fixed kernel and things work again. 1334. Date: Sat, 30 Jun 90 13:41:45 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: allspice crash allspice's swap disk filled up. lots and lots of msgs on its console about WRITE FAILED. then, a panic with: Fsdm_FileDescTrunc abandoning (indirect) block 134672 in <1,42334> "154" savedLastByte 16383 Fscache_DeleteFile "154" <1,42334>: 1 cache blocks left (gdb) where #0 panic (__builtin_va_alist=-167521065) (sysPrintf.c line 209) #1 0xf603d760 in Fscache_DeleteFile (...) (...) #2 0xf6044ba4 in Fsio_FileTrunc (...) (...) #3 0xf604b4b8 in Fslcl_DeleteFileDesc (...) (...) #4 0xf60436f4 in Fsio_FileCloseInt (...) (...) #5 0xf604b2d4 in DeleteFileName (...) (...) #6 0xf6049c18 in FslclLookup (...) (...) #7 0xf6048e60 in FslclRemove (...) (...) #8 0xf6058e88 in Fsrmt_RpcRemove (...) (...) #9 0xf608fca8 in Rpc_Server (...) (...) #10 0xf60940e0 in Sched_StartKernProc (...) (...) #11 0xf6094060 in Sched_StartKernProc (...) (...) ERROR: invalid read address 0x2e4265a8 (gdb) up Reading in symbols for fsCacheOps.c...done. #1 0xf603d760 in Fscache_DeleteFile (cacheInfoPtr=(struct Fscache_FileInfo *) 0xf6f2b248) (fsCacheOps.c line 1380) 1380 cacheInfoPtr->flags); (gdb) l 1375 (cacheInfoPtr->flags & (FSCACHE_FILE_ON_DIRTY_LIST| 1376 FSCACHE_FILE_BEING_WRITTEN))) { 1377 panic("Fscache_DeleteFile failed \"%s\" blocks %d flags %x\n", 1378 Fsutil_HandleName(cacheInfoPtr->hdrPtr), 1379 cacheInfoPtr->blocksInCache, 1380 cacheInfoPtr->flags); 1381 } 1382 UNLOCK_MONITOR; 1383 } ... i poked around, thought this was due to its inability to free space in its cache, and continued. it survived for a moment but when i tried deleting a file from lost+found to make room on the disk, it died with a bad address trying to deference with a bogus block number. the backtrace this time was: #3 0xf60321f8 in FsdmBlockFree (domainPtr=(struct Fsdm_Domain *) 0xf65b71b8, blockNum=420909120) (fsAlloc.c line 1317) #4 0xf60312ac in Fsdm_FileDescTrunc (handlePtr=(struct Fsio_FileIOHandle *) 0xf6b026a8, size=0) (fsAlloc.c line 646) #5 0xf6044b8c in Fsio_FileTrunc (...) (...) #6 0xf604b4b8 in Fslcl_DeleteFileDesc (...) (...) #7 0xf60436f4 in Fsio_FileCloseInt (...) (...) #8 0xf604b2d4 in DeleteFileName (...) (...) #9 0xf6049c18 in FslclLookup (...) (...) #10 0xf6048e60 in FslclRemove (...) (...) #11 0xf6058e88 in Fsrmt_RpcRemove (...) (...) looks like Fsdm_FileDescTrunc has to be more defensive about the block numbers it comes up with. finally, a third bug: i followed the instructions on allspice's console to boot "sd()vmsprite" but it claimed "couldn't attach disk". same for vmunix. i had to boot from mint. 1335. Date: Sat, 30 Jun 90 14:17:34 PDT From: Fred Douglis <douglis> Subject: ds3100 randomness when i went into 608-4 to debug allspice i saw that violence was in the debugger. it died when the ptr passed to RpcDaemonWait became 0. since it's called with the address of a structure that looks perfectly fine, the memory must have been trashed. and, piquante has died 3 times in 3 days with "coprocessor unusable" exceptions. i debugged it this time and it's sitting at a perfectly reasonable instruction having nothing to do with the coprocessor. (or does the coprocessor refer to the TLB?) it was apparently continuable. 1336. Date: Sat, 30 Jun 90 17:33:08 PDT From: Fred Douglis <douglis> Subject: allspice network interface allspice was acting incredibly sluggish again. i was able to contact it enough to talk to its sendmail daemon for a moment, but then even that hung. restarting the ipServer didn't help, but hitting l1-n did. 1337. Date: Sun, 01 Jul 90 14:53:20 PDT From: tve (Thorsten von Eicken) Subject: qsort definition In the man page: SYNOPSIS qsort(base, nel, width, compar) char *base; int (*compar)(); DESCRIPTION ... but: [crackle sun4.md] egrep qsort /sprite/lib/include/*.h /sprite/lib/include/stdlib.h:extern void qsort(); so, does it return something or not? 1338. Date: Sun, 1 Jul 1990 23:05:47 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: new sun3 bcopy I've installed a new sun3 bcopy routine that doesn't do half-word aligned loads if the dest is greater than the source and the length of the transfer ends in either 0x2 or 0x3. The old version would bcopy at half the speed per byte in this case, causing the jaggies on my ultranet performance graph. The new one bcopies at a constant speed per byte, regardless of the length of the input. I installed a new libc.a, but did not install a new libc module due to its current state of confusion. 1339. Date: Tue, 03 Jul 90 14:39:11 PDT From: Fred Douglis <douglis> Subject: out of space wedging system my commands on sage are hanging. from an rlogin, i can't tell diddly. by running "cat /hosts/sage/dev/syslog" remotely i can tell that it's printing complaints about allspice out of space (on /mic) constantly. 1340. Date: Tue, 3 Jul 1990 22:32:56 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: /dev Unless someone has a good reason why we shouldn't, I propose that all devices in /dev have a serverID of -1 (localhost). It is a pain in the #%?!& to trip over all the devices whose host isn't up. It takes hours to do an update of the root partition. Please make don't make devices in /dev have a specific server. If a device needs a specific server then put it in /hosts/foo/dev. Don't make too many of these either. Thanks. 1341. Date: Wed, 4 Jul 90 12:18:48 PDT From: tve (Thorsten von Eicken) Subject: mint enetered debugger this morning I simply rebooted - everything went fine. Look at the console for details... 1342. Date: Wed, 4 Jul 90 12:49:43 PDT From: pmchen (Peter M. Chen) Subject: Re: mint enetered debugger this morning This morning I found mustard (ds3100 running 1.068) in the debugger (this has happened before, talk to Ken about it). I rebooted and it came back, but odd things were happening to my X server (it wouldn't start, couldn't find the DISPLAY variable) as well as other weird things like pwd going into the debugger. I rebooted several times. The last one caused mint to hang. This has happened before (twice) where rebooting mustard causes mint to hang, under kernels 1.066 and 1.067, I think. I've notified (or tried to notify in this case) spriters when mustard crashes in this way. Any advice for next time? It may be a hardware problem, so I could switch to garlic (another ds3100). 1343. Date: Wed, 04 Jul 90 23:50:41 PDT From: rab (Robert A. Bruce) Subject: oregano deadlocked Oregano deadlocked on the timerMutex. Timer: timerMutex @ 0xe0a61a0 Holder PC: 0x0 Current PC: 0x2eb00 Holder PCB @ 0xe0abaa4 Current PCB @ 0xe245a9c Error type 47 while syncing disks 1344. Date: Thu, 05 Jul 90 08:08:04 PDT From: rab (Robert A. Bruce) Subject: allspice Allspice crashed: Fatal Error: MachHandleWeirdoInstruction: The error occurred in a user process with procPtr = f650770 and pc = f602cfc4 It was running sun4.JHH.915 1345. Date: Thu, 5 Jul 90 11:15:20 PDT From: mendel (Mendel Rosenblum) Subject: Newly installed tx broken The newly installed Tx goes into an infinite loop if you try to output a long line. The problems occurs if you try to build a Sprite kernel because make outputs the ld command as a long line. This happens on both the sun4c and ds3100. I backed out /sprite/cmds.sun4/tx to /sprite/cmds.sun4.old/tx. 1346. Date: Thu, 05 Jul 90 16:33:58 PDT From: Fred Douglis <douglis> Subject: ds3100 syslog reopen crashes system the ds3100 dev/fs recovery table had a bug and had no entry for the syslog device. thus, if a host has at any time opened /hosts/<ds3100>/dev/syslog, the poor ds3100 may crash each time it reboots and the first host recovers with it. i've changed this in the sources and it'll make it into the next kernel. i'm curious: should lint have turned this up? in any case, is the kernel due for a major de-linting session? i've been seeing lots of messages. 1347. Date: Thu, 05 Jul 90 17:20:41 PDT From: Fred Douglis <douglis> Subject: allspice crash in Fscache_DeleteFile same crash as bug report #29670. a more careful look showed that the list of blocksInCache was empty though the count of blocks was 1. 1348. Date: Thu, 5 Jul 90 21:15:33 PDT From: shirriff (Ken Shirriff) Subject: Allspice crash Allspice crashed this afternoon in List_Remove with invalid list pointers. The back trace was: List_Remove(&blockPtr->fileLinks) Delete_Block(blockPtr) Fetch_Block() Fscache_FetchBlock(); blockPtr seemed to be valid, but blockPtr->fileLinks was pointing the wrong place. Maybe two processes updated fileLinks and it wasn't locked? 1349. Date: Fri, 6 Jul 90 09:38:11 PDT From: ouster (John Ousterhout) Subject: Corrupted files The following files were corrupted recently. Bob, can you restore these from tape? Perhaps we should push out a new kernel that has the fix for the file-corruption problem? /mic/octtools/src/cmds/vem/schematic/ds3100.md/md.mk /mic/octtools/src/cmds/vem/symbolic/.newsrc /mic/octtools/src/cmds/TimberWolfMC/parser.c /sprite/lib/ps/tex.pro 1350. Date: Fri, 06 Jul 90 15:45:15 PDT From: Fred Douglis <douglis> Subject: migration trashed register/memory? cpp went into the debugger as a result of being migrated. it dereferenced a garbage pointer in code that looks like it couldn't have happened: struct file_name_list* ptr; for (ptr = dont_repeat_files; ptr; ptr = ptr->next) { if (!strcmp (ptr->fname, fname)) { close (f); return; /* This file was once'd. */ } } dont_repeat_files was NULL, so i don't see how it could have wound up with ptr==0x80940000. the obvious possibilities are that either the PC is getting messed up during migration, or registers aren't being saved right. since migration on sun4cs causes an error once in a blue moon (much less often than ds3100, which would happen nearly 100% of the time during compilations) i'm wondering if the machine-dependent state encapsulation has a minor bug. 1351. Date: Fri, 6 Jul 90 18:05:52 PDT From: shirriff (Ken Shirriff) Subject: Allspice crash Allspice had a new crash (right on schedule; Friday afternoon): Fatal Error: MachHandleWeirdoInstruction: unaligned address trap This was running sun4.JHH.915 FslclLookup was called from FslclOpen, to lookup "./..". Apparently FindComponent("..") returned success, but curHandlePtr->descPtr was NIL, so it died soon after when it tried to access curHandlePtr->descPtr->fileType. This is probably the same race that Sequent had and I fixed with them a few weeks ago. I just got mail from fubar with the changes they are currently using, so I'll integrate these with our sources. 1352. Date: Fri, 6 Jul 90 18:14:26 PDT From: shirriff (Ken Shirriff) Subject: strange processes Nutmeg was running very slow, even for a sun3. I did a ps and found the following processes: root 60321 63.5 0.2 144 16 READY20008:07 login root root 10319 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root 40318 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root 9031e 0.0 0.0 88 0 WAIT 0:01 loadavg root 6031c 0.0 0.0 88 0 WAIT 0:01 loadavg root 60322 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root d0312 0.0 0.0 88 0 WAIT 0:01 loadavg root 313 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root 20320 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root 90336 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root 2033a 0.0 0.0 72 0 WAIT 0:00 sh -c /c/stats/RAW root e033e 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root 10347 0.0 0.0 88 0 WAIT 0:01 loadavg root 2034a 0.0 0.0 88 0 WAIT 0:00 loadavg root 1034e 0.0 0.0 72 0 WAIT 0:00 sh -c /c/stats/RAW root 1034f 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root b032a 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW root f032e 0.0 0.0 88 0 WAIT 0:00 loadavg root 20332 0.0 0.0 72 0 WAIT 0:01 sh -c /c/stats/RAW Any idea what these are, and why login would soak up all this time? 1353. Date: Sat, 7 Jul 90 17:50:00 PDT From: mendel (Mendel Rosenblum) Subject: fscheck bug We (Mary, Ken, and I) have identified and corrected a problem that has caused repeated crashes of allspice. The problem was that /sprite/src/kernel/sig/.. didn't point to /sprite/src/kernel. It contained the file number of an unallocated file. This meant that anyone touching /sprite/src/kernel/sig caused allspice to panic. Unfortunately, fscheck didn't detect or repair the problem. I had to patch the directory in the file cache and write it back to allow allspice to be usable again. Bugs: 1) "/sprite/src/kernel/sig/.." got changed from file number 2 to the file number of an unallocated file. 2) Fscheck didn't detect the problem. 1354. Date: Sat, 7 Jul 90 18:59:34 PDT From: shirriff (Ken Shirriff) Subject: Re: fscheck bug -- restore bug? The restore program seems to have a serious bug in it that causes the directory structure to get messed up. I restored /sprite/src/kernel/sig into /sprite/src/kernel/restore/sprite/src/kernel/sig. I used restore with the -r flag, which is supposed to preserve the current file and move the restored file elsewhere. (~shirriff/allspice.trace shows what I did.) However /sprite/src/kernel/sig now seems to be some kind of hard link to /sprite/src/kernel/restore/sprite/src/kernel/sig. If you do a "cd" to /sprite/src/kernel/sig, you end up in /sprite/src/kernel/restore/sprite/src/kernel/sig. I think this is what caused the problems with allspice the last couple days. Bob restored sig for me when I accidentally nuked part of it, I copied the restored things to /s/s/k/sig, and then I deleted /s/s/k/restore/*. Evidently /s/s/k/sig/.. then pointed to the nonexistent /s/s/k/restore/s/s/k and made /s/s/k/sig a poison directory. In summary: a) restore is evil b) /sprite/src/kernel/sig is messed up again (but not poison) and I don't know how to fix it. Maybe fscheck would do the right think this time, but I don't want to try it. 1355. Date: Sun, 8 Jul 90 09:28:21 PDT From: ouster (John Ousterhout) Subject: Corrupted files? The checksummer reported corruption in the following files: /sprite/src/kernel/dbg.jhh/sun3.md/RCS/dbgMain.c,v /sprite/src/kernel/dbg/ds3100.md/dbgMain.o I checked the first file and it looks truncated rather than corrupted. Bob, can you restore it from tape? I assume that the second file can simply be recreated. 1356. Date: Mon, 09 Jul 90 13:19:29 PDT From: Fred Douglis <douglis> Subject: compilation/loader problem i'm running into trouble making "TM=cleansun3" for proc. this was done just as a test of pmake, except that it failed to load. now it turns out that remaking it from scratch doesn't succeed, even w/o migration (which is what i thought was fouling things up.) i get: ld: malformed input file (not rel or archive) cleansun3.md/procMach.o i can't see anything obvious wrong with the file. anyone have any suggestions what it might mean? sun3 links fine, so it's cleansun3 that has the problem. procMach.c has no "#ifdef clean" lines in it, and making for cleansun3 does define sun3, so that shouldn't be the problem. 1357. Date: Mon, 9 Jul 1990 21:34:39 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: hard links to directories The creation/deletion of hard links to directories doesn't work very well. At the very least the link counts are not updated properly. I have a feeling that this feature is little used and little debugged. Someone should look over the code and make sure it looks ok. 1358. Date: Tue, 10 Jul 90 10:41:59 PDT From: Fred Douglis <douglis> Subject: lpd hoses again when, oh when, will lpd be reliable? i sent a 2-page document to our printer when it was already complaining about being out of paper from a previous job. when i added paper, it printed 2 copies of page 2 and tossed the job. 1359. Date: Tue, 10 Jul 90 18:33:59 PDT From: shirriff (Ken Shirriff) Subject: tar.gnu bug tar.gnu has a bug that causes it to dump directories as symbolic links, instead of directories. This is why directories end up with hard links on restore. This problem is in tar.gnu, but not tar. Basically, if you do a tar.gnu with the 'n' flag (for no recursion), it is supposed to dump the directory as a regular file. However, tar.gnu gets confused and ends up dumping it as a link. I'd like to discuss this with someone who understands tar.gnu (and why we have two tars), so I don't make a change that destroys all our dumps. Also, there's a bug in the file system that allows a regular user to make a hard link to a directory (you should be superuser to do this). I added a permission check to the fs code. 1360. Date: Wed, 11 Jul 90 11:18:30 PDT From: tve (Thorsten von Eicken) Subject: allspice has several sendmail in DEBUG doesn't accept ftp 1361. Date: Wed, 11 Jul 90 16:25:38 PDT From: shirriff (Ken Shirriff) Subject: Migration problem? I did a pmake on treason and got: *** compat: Cannot decode user status value 0xc0303c78 MigOpenPdev: Error opening pdev /sprite/admin/migd/pdev (still trying): invalid argument. MigOpenPdev: Succeeded in opening pdev. and then it worked okay. 1362. Date: Wed, 11 Jul 90 22:00:07 PDT From: pmchen@ginger.Berkeley.EDU (Peter M. Chen) Subject: ping to sprite sun4's I did the following pings to allspice and sassafras (both sun4's) from ginger. ping to allspice takes place in 0 time? ping to sassafras doesn't respond at all, even though it responds fine when I ping it from sprite. ping to mustard takes 19ms for the first one, then 0 time? PING allspice.Berkeley.EDU (128.32.150.27): 56 data bytes 64 bytes from 128.32.150.27: icmp_seq=0. time=0. ms 64 bytes from 128.32.150.27: icmp_seq=8. time=0. ms ----allspice.Berkeley.EDU PING Statistics---- 9 packets transmitted, 9 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/0 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% PING sassafras.Berkeley.EDU (128.32.150.41): 56 data bytes ----sassafras.Berkeley.EDU PING Statistics---- 10 packets transmitted, 0 packets received, 100% packet loss %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% PING mustard.Berkeley.EDU (128.32.150.38): 56 data bytes 64 bytes from 128.32.150.38: icmp_seq=0. time=19. ms 64 bytes from 128.32.150.38: icmp_seq=5. time=0. ms ----mustard.Berkeley.EDU PING Statistics---- 6 packets transmitted, 6 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/3/19 1363. Date: Thu, 12 Jul 90 12:42:01 PDT From: Fred Douglis <douglis> Subject: "/" filled ... X0msgs and bootplog both of these files were getting written repeatedly. someone's X server was printing "select failed" and bootp on mint was printing "recvfrom failed". 1364. Date: Thu, 12 Jul 90 17:48:56 PDT From: mendel (Mendel Rosenblum) Subject: ps can't add when run remotely If you run the ps command with the "-v" option from rsh you get incorrect memory size totals. For example: jaywalk% ps -v PID CODSZ CODRS HPSZ HPRS STKSZ STKRS SIZE RSS COMMAND 3120c 272 216 172 140 16 16 460 372 twm 71211 360 340 372 348 20 12 752 700 tx -D -title Console ... 51213 104 0 56 0 4 0 164 0 xinit 51214 704 444 1384 840 20 12 2108 1296 X :0 61216 360 340 404 380 20 12 784 732 tx =80x35+0+374 ... 41217 144 140 48 12 12 4 204 156 /users/mendel/.xinitrc ... 61218 144 140 80 72 24 20 248 232 /sprite/cmds/csh 61219 360 340 164 120 20 8 544 468 tx -title /dev/syslog ... 4121c 144 140 56 12 24 4 224 156 -csh 6121d 360 340 268 232 20 12 648 584 tx =80x35+576+374 ... 6121f 32 24 20 8 4 4 56 36 /sprite/cmds/cat ... 21220 40 32 24 24 4 4 68 60 /sprite/daemons/xgoned 11221 320 104 212 68 16 8 548 180 xbiff -geometry ... 11223 344 164 368 164 28 12 740 340 spritemon -geometry ... 11225 144 140 92 76 24 16 260 232 /sprite/cmds/csh -i 91228 64 60 36 24 136 56 236 140 ps -v 1122a 144 140 112 104 24 20 280 264 /sprite/cmds/csh -i a122b 312 292 224 180 20 8 556 480 mx =80x36 ... 4122f 360 340 212 176 20 12 592 528 tx =80x37+0+378 -title ... 61230 80 72 60 12 8 4 148 88 rlogin rosemary 61231 80 72 60 16 8 4 148 92 rlogin rosemary 1123c 312 292 336 320 20 20 668 632 mx =80x36 ... f1249 112 112 60 56 20 20 192 188 mail bugs ----------------------------------------------------- Total 2888 2000 4820 3384 512 288 8220 5672 ^^^^^^^^^^^^^^^^^^^^^^ Right jaywalk% rsh jaywalk ps -v PID CODSZ CODRS HPSZ HPRS STKSZ STKRS SIZE RSS COMMAND 3120c 272 216 172 140 16 16 460 372 twm 71211 360 340 372 348 20 12 752 700 tx -D -title Console ... 51213 104 0 56 0 4 0 164 0 xinit 51214 704 444 1384 840 20 12 2108 1296 X :0 61216 360 340 404 380 20 12 784 732 tx =80x35+0+374 ... 41217 144 140 48 12 12 4 204 156 /users/mendel/.xinitrc ... 61218 144 140 80 72 24 20 248 232 /sprite/cmds/csh 61219 360 340 164 120 20 8 544 468 tx -title /dev/syslog ... 4121c 144 140 56 12 24 4 224 156 -csh 6121d 360 340 272 236 20 12 652 588 tx =80x35+576+374 ... 6121f 32 24 20 8 4 4 56 36 /sprite/cmds/cat ... 21220 40 32 24 24 4 4 68 60 /sprite/daemons/xgoned 11221 320 104 212 68 16 8 548 180 xbiff -geometry ... 11223 344 164 368 164 28 12 740 340 spritemon -geometry ... 11225 144 140 92 76 24 16 260 232 /sprite/cmds/csh -i d1229 72 64 48 48 12 12 132 124 rsh jaywalk ps -v 1122a 144 140 112 104 24 20 280 264 /sprite/cmds/csh -i a122b 312 292 224 180 20 8 556 480 mx =80x36 ... 1122e 144 140 44 40 12 12 200 192 csh -c ps -v 4122f 360 340 212 176 20 12 592 528 tx =80x37+0+378 -title ... 61230 80 72 60 12 8 4 148 88 rlogin rosemary 61231 80 72 60 16 8 4 148 92 rlogin rosemary 91232 72 64 48 48 12 12 132 124 rsh jaywalk ps -v e1233 64 60 36 24 140 60 240 144 ps -v 1123c 312 292 336 320 20 20 668 632 mx =80x36 ... f1249 112 112 60 56 20 20 192 188 mail bugs ----------------------------------------------------- Total 0 0 0 0 0 0 0 0 ^^^^^^^^^^^^^^^^^^^^^^ Wrong jaywalk% Very weird!! 1365. Date: Thu, 12 Jul 90 23:37:04 PDT From: shirriff (Ken Shirriff) Subject: Interesting ds3100 crash I had a program that did repeated illegal instructions, so it printed "Reserved instruction in process..." to the syslog a whole bunch. Eventually the syslog buffer overflowed, so at the bottom of the screen it printed: "Dev_SyslogWrite: Buffer", and then hung. It doesn't respond to L1-D or L1-A. 1366. Date: Fri, 13 Jul 90 10:55:45 PDT From: ouster (John Ousterhout) Subject: Mustard hanging migrations? I tried running pmakes twice this morning on Piracy, and both times they hung up in un-killable states. At about the same time in each case, a message "RpcDoCall (mig command) RPC to must is hung" appeared on the syslog. Rup lists mustard as up, but ping and other network commands don't work to it. 1367. Date: Fri, 13 Jul 90 15:32:59 PDT From: eklee (Edward K. Lee) Subject: pmake does not properly expand shell metasymbols --- Makefile --- {a,b,c}.o : echo %@ --- forgery% pmake --- a.o --- echo a.o a.o forgery% According to the pmake documentation, {a,b,c}.o should be expanded to a.o, b.o, c.o. 1368. Date: Fri, 13 Jul 90 22:35:26 PDT From: kupfer (Mike Kupfer) Subject: sage% man chpass I did "man chpass" and got sage% man chpass Reformatting manual entry. Please wait... sage% ls shows a new zero-length file. 1369. Date: Fri, 13 Jul 90 23:16:41 PDT From: Mike Kupfer <kupfer> Subject: can't change shell I dunno if this is a bug or an administration glitch. At any rate, I can't seem to change my shell (to tcsh). chpass complains that it doesn't know who I am. sage% chpass kupfer chpass: unknown user kupfer. 1370. Date: Fri, 13 Jul 90 19:15:30 PDT From: bsw!adam@uunet.UU.NET (Adam de Boor) Subject: pmake does not properly expand shell metasymbols You are right in saying that "{a,b,c}.o" expands to "a.o b.o c.o" but they are not treated as a single target. Rather, you've created three separate targets "a.o", "b.o" and "c.o" each of which can be remade by the command echo %@ where %@ is either "a.o", "b.o" or "c.o". Nowhere in the documentation does it imply that "{a,b,c}.o" creates a special group target... 1371. Date: Sat, 14 Jul 90 00:22:11 PDT From: elm (ethan miller) Subject: tcsh problems On the sun4c, I have problems logging in with tcsh occasionally. This only happens on the sun4c (as far as I know, it has never even happened on the sun4). I get a MachWindowUnderflow (or Overflow, I don't remember which). I can reproduce this bug pretty regularly, though it helps if the machine I'm logging into hasn't been using tcsh in a while. tcsh crashes sometime between the last command in my .login file and presenting the prompt. Any ideas what is causing this? It has been happening for quite some time (several months). I've mentioned it before, but nothing was done. 1372. Date: Sat, 14 Jul 90 00:25:39 PDT From: elm (ethan miller) Subject: more on tcsh bug To aid in debugging, I've left a copy of tcsh in the debugger on joyride. Its process ID is 24a22. I tried changing the last command in my .login from an uptime command to echo "". Same thing happened (the login shell died just before the prompt). 1373. Date: Sat, 14 Jul 90 15:44:57 PDT From: eklee (Edward K. Lee) Subject: Re: pmake does not properly expand shell metasymbols >From bsw!adam@uunet.UU.NET Sat Jul 14 00:06:33 1990 Date: Fri, 13 Jul 90 19:15:30 PDT From: bsw!adam@uunet.UU.NET (Adam de Boor) To: eklee@sprite.Berkeley.EDU Cc: bugs@sprite.Berkeley.EDU In-Reply-To: Edward K. Lee's message of Fri, 13 Jul 90 15:32:59 PDT <9007132232.AA928536@sprite.Berkeley.EDU> Subject: pmake does not properly expand shell metasymbols >You are right in saying that "{a,b,c}.o" expands to "a.o b.o c.o" but >they are not treated as a single target. Rather, you've created three >separate targets "a.o", "b.o" and "c.o" each of which can be remade by >the command > > echo %@ > >where %@ is either "a.o", "b.o" or "c.o". Nowhere in the documentation >does it imply that "{a,b,c}.o" creates a special group target... I realize that. I was trying to make each individually. I assumed incorrectly that a Makefile of the form: a.o b.o c.o: echo %@ would cause the first dependency to be made but make actually only makes the first target. I what I meant was: all:a.o b.o c.o a.o b.o c.o: echo %@ 1374. Date: Mon, 16 Jul 90 11:04:07 PDT From: ouster (John Ousterhout) Subject: Bad morning for Allspice When I came in at 8:20 this morning, Allspice was dead. It took until about 10:45 to get Sprite back up again. Here's the sequence of events that occurred: 1. When I arrived, Allspice was in the debugger with a Level 15 Interrupt. In addition, Mint was out of disk space in "/". I don't know if the two events could be relate. 2. The reason that "/" was full was the ip.out file for burble, which had grown to 1.5 Mbytes. I recall messages from Thorsten a while ago about changing the ipServer on some machines; could this have generated the huge file? Thorsten, could you look into this and make sure that Burble's ip.out file won't go crazy again? Finding the culprit took a long time ("du" spent about 20 minutes printing timeout messages for devices on machines that aren't running Sprite). 3. As part of freeing up space I had to enter 440 Evans and manually reboot burble (the ipServer still had the file open so its space didn't free up; I couldn't rlogin to get the pid for the process; and attempts to reboot it remotely had no effect). I couldn't tell which machine in 440 was burble, so I rebooted crackle too. 4. While freeing space on "/" I rebooted Allspice. The first two times were with the JHH kernel advertised cryptically on a loose sheet of paper on top of Allspice's console. At a very early point in rebooting (before starting to check disks) Allspice crashed with another Level 15 Interrupt. Note that I hadn't yet eliminated the space problem on "/" when the second crash occurred. 5. The next time I rebooted Allspice was after space had been freed on "/". This time it checked all the disks, but at one point along the way printed a message about a write error on one of the disks. After checking the disks, Allspice didn't export any of them, but bailed to a shell (no noticeable message about why). 6. At this point I began to suspect the JHH kernel, so I rebooted with "sun4.new". This kernel crashed during the disk check with a message about a bad kernel page fault. The pc where the bad reference was made was 0xf6034d34, and the bad reference was to address 0x18c. 7. I rebooted Allspice again (sun4.new) and the same error occurred at the same place. At this point I decided that sun4.new was bad. 8. I rebooted Allspice again, with the JHH kernel, and this time it got all the way through booting. 1375. Date: Mon, 16 Jul 1990 11:19:13 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Re: Bad morning for Allspice The JHH kernel was the correct one to boot. There is a message to that effect on the console. A write error to a disk is considered a hard error. The response for a hard error is to bail to a shell, so that someone may fix the problem. 1376. Date: Mon, 16 Jul 90 11:53:20 PDT From: shirriff (Ken Shirriff) Subject: Mustard crash Mustard crashed but the debugger kept timing out and resending. It crashed with a TLB Load/Store at 0x800884fc running 1.068. Debugging I got: Bus error [Net_NetToHostShort.Net_NetToHostShort:53 ,0x800884fc] and then it kept timing out and resending. The code is: >*[Net_NetToHostShort:53, 0x800884fc] lbu r15,6(sp) and the stack pointer is 0xc100bce0. This code is in the startup code for Net_NetToHostShort to save registers. Two questions: a) Why did this instruction crash? (The surrounding instructions are similar). b) Why did the debugger time out whenever I tried examining anything? 1377. Date: Mon, 16 Jul 90 21:10:26 PDT From: pmchen (Peter M. Chen) Subject: rup rup seems to give erroneous information. sage is down now, but rup lists it as up. Also, allspice is up, but rup lists it as down. This hangs migrations which are unlucky enough to migrate to a host which is down but listed as up. 1378. Date: Tue, 17 Jul 90 10:14:34 PDT From: shirriff (Ken Shirriff) Subject: Kernel install howto /sprite/admin/howto says: > 3) *for each machine type*, recompile all the kernel directories: > > % pmake TM=... clean > % pmake TM=... all But if I do this, 'pmake TM=ds3100 all', for example, ends up compiling for all machine types. This means a) I end up compiling sun versions on the ds3100; b) everything ends up compiled 4 times. So what is the proper incantation to compile the kernel modules? 1379. Date: Tue, 17 Jul 90 10:27:53 PDT From: mgbaker (Mary Gray Baker) Subject: Re: Kernel install howto Hmmm... Maybe Fred just meant one compile for each machine type. That's what I do. I spread out the work by logging into some sun4c and doing the sun4c compiles and maybe the sun4 or sun3 compiles. I only do the ds3100 compiles on a decstation, since I usually get so many migration errors on the decstations that it's not worth it to compile the other machine types there. 1380. Date: Tue, 17 Jul 90 12:13:08 PDT From: pmchen (Peter M. Chen) Subject: last allspice "crash" Right before allspice hung this morning (around noon), I had started a pmake which spawned off several (5, I think) simulations, each of which uses 10-20MB of memory. The pmake was started from mercenary (sun4c). I'm not sure why this would hang allspice, unless the programs were thrashing or /swap1 filled up. But there were still 53 MB free on /swap1 when I looked (after the simulations were running). And 10-20 MB should have mostly fit in a sparcstation's memory (I have a ps -v output which says they were all mostly memory resident). 1381. Date: Tue, 17 Jul 90 14:35:50 PDT From: shirriff (Ken Shirriff) Subject: Net_InstallRoute With my new kernel, I get: Warning: Net_InstallRoute: bad name arg Initsprite: boot command file exited abnormally on the sun3. Before I investigate, does anyone know why this would happen? Bonus question: what's the easiest way to check inside the kernel if an address is a valid kernel address? I.e. I'm given a pointer and I want to know if I can read the address without a bus error. 1382. Date: Tue, 17 Jul 90 13:35:07 PDT From: mendel (Mendel Rosenblum) Subject: Re: last allspice "crash" >Right before allspice hung this morning (around noon), I had started >a pmake which spawned off several (5, I think) simulations, each of >which uses 10-20MB of memory. The pmake was started from mercenary >(sun4c). > >I'm not sure why this would hang allspice, unless the programs were >thrashing or /swap1 filled up. But there were still 53 MB free on >/swap1 when I looked (after the simulations were running). And >10-20 MB should have mostly fit in a sparcstation's memory >(I have a ps -v output which says they were all mostly memory resident). > >Pete > There are several problems here. One is that 5*20MB is more swap space than is usally available. Migration works by swapping out the program and swapping it back in on the new machine. This means that swap space gets allocated for these processes during evict and doesn't get freed until the process exits. Last night /swap1 filled because of this reason. Also the current client swapping code pounds allspice into the ground. Evicting a 20Meg process will cause allspice to become unusable for many (< 3 ) minutes. So /swap1 fills killing other processes that try to allocate space and allspice hangs or times-outs. In addition, there are several sun4c running Sprite with 16Megs of memory. If a 20Meg job gets on one of these machines allspice gets pounded. This happen with burble last night. It appears that one thrashing sun4c or ds3100 can drive allspice catatonic. 1383. Date: Tue, 17 Jul 90 14:43:40 PDT From: shirriff (Ken Shirriff) Subject: Re: adduser script I changed adduser to be world-executable so Bob can run it. I presume we don't want to leave it this way, but should have Bob and adduser in the same group. 1384. Date: Tue, 17 Jul 1990 15:12:34 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: sprite labels obsolete We had a potentially very serious situation this afternoon in which assault could not attach 4 of its 5 disks. The bottom line is that those disks had Sprite labels, and Sprite labels are no longer supported. I was able to find an old copy of labeldisk that understood Sprite labels so they now have Dec labels. This mishap was my fault since I removed the support for Sprite labels before actually verifying that there weren't any disks that used them. I assumed that they had been changed back when Dec labels became available. 1385. Date: Tue, 17 Jul 90 15:29:57 PDT From: elm (ethan miller) Subject: xload troubles On my sparcstation (terrorism), xload has been dying a lot recently. It doesn't seem to be the normal sort of thing (Fred playing with the migration daemon). Any ideas? 1386. Date: Tue, 17 Jul 90 15:57:29 PDT From: mgbaker (Mary Gray Baker) Subject: More on tcsh problem I'm relaying more information on the tcsh/sun4c problem from Ethan. He thinks the problem may have something to do with the savehist feature, since when it dies it always dies in a routine dealing with the history file. There's something going on with a sighold on SIG_INT and a worrisome comment in the code about getting any signals there. Also, the program seems to die less frequently if Ethan keeps a shorter history. 1387. Date: Tue, 17 Jul 1990 16:44:39 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bug in rpc system A 'chmod' of a remote device that does not exist doesn't work. On the local host I get the following message: <setIOAttr> 7/17/90 16:39:30 kvetching (2) RPC timed-out Fsrmt_SetIOAttr failed <30002>: device <2,1> at server 2 On the remote machine I get : RpcResend: RPC 23, client 49, RPC seq # 19ae9, forgot reply? RpcResend: RPC 23, client 49, RPC seq # 19ae9, forgot reply? RpcResend: RPC 23, client 49, RPC seq # 19ae9, forgot reply? RpcResend: RPC 23, client 49, RPC seq # 19ae9, forgot reply? RpcResend: RPC 23, client 49, RPC seq # 19ae9, forgot reply? RpcResend: RPC 23, client 49, RPC seq # 19ae9, forgot reply? RpcResend: RPC 23, client 49, RPC seq # 19ae9, forgot reply? All of this takes several seconds to happen, and 'update' does it twice for each device. This causes an 'update' of '/' to take approximately forever. 1388. Date: Wed, 18 Jul 90 12:07:16 PDT From: pmchen (Peter M. Chen) Subject: LE ethernet: Memory underflow error. I get tons of these messages when I run big jobs on one machine which have big swap images. LE ethernet: Memory underflow error. LE ethernet: Reinitialized chip. Also, LE ethernet: Received packet with overflow error. This has happened on more than one machine. 1389. Date: Wed, 18 Jul 90 15:25:30 PDT From: shirriff (Ken Shirriff) Subject: pmake installhdrsall problem pmake installhdrsall chokes on symbolic links. It keeps telling me: Type of "sun4c.md/timerTick.h" (regular file) differs from "../Include/sun4c.md/ timerTick.h" (symbolic link). Note: following a link to a regular file due to "-l" option. If I delete Include/sun4c.md/timerTick.h it then installs properly, and then I can start again. 1390. Date: Wed, 18 Jul 90 16:53:19 PDT From: shirriff (Ken Shirriff) Subject: mkmf bug? I can't convince pmake installhdrs to install vm/sun4c.md/vmMachInt.h. It installs other vm/sun4c.md headers fine. After doing a mkmf, sun4c.md/md.mk contains: HDRS = ... sun4c.md/vmMachInt.h ... MDPUBHRS = ... with no sun4c.md/vmMachInt.h If I put sun4c.md/vmMachInt.h in MDPUBHDRS the header gets installed. So why doesn't mkmf put sun4c.md/vmMachInt.h in the MDPUBHDRS like it does with sun4c.md/vmMach.h, for instance? 1391. Date: Wed, 18 Jul 90 20:23:08 PDT From: ouster (John Ousterhout) Subject: Re: mkmf bug? You said: I can't convince pmake installhdrs to install vm/sun4c.md/vmMachInt.h. It installs other vm/sun4c.md headers fine. After doing a mkmf, sun4c.md/md.mk contains: HDRS = ... sun4c.md/vmMachInt.h ... MDPUBHRS = ... with no sun4c.md/vmMachInt.h If I put sun4c.md/vmMachInt.h in MDPUBHDRS the header gets installed. So why doesn't mkmf put sun4c.md/vmMachInt.h in the MDPUBHDRS like it does with sun4c.md/vmMach.h, for instance? I believe that mkmf has rules about which files are considered "public" and which are "private", and that the rules are based on the file's name. As I remember, a header file is considered private if either (a) its name doesn't start with the module prefix, or (b) its name ends in "Int.h". I wouldn't think vmMachInt.h should be getting installed: why should anyone outside vm need to access it? 1392. Date: Thu, 19 Jul 90 09:13:10 PDT From: mendel (Mendel Rosenblum) Subject: Re: LE ethernet: Memory underflow error. This is due to a problem with the sparcStation hardware. The DMA controller does not meet the minimum latency requires of the LANCE ethernet chip. The LANCE has a minimum latency requirement of 3.74 microseconds. If the CPU is doing other memory intensive operations such as bcopy or cache flushing the LANCE chip over/underruns its fifo and reports an error. We (Sprite group) can't remove this problem but we can remove the printf that report it to the syslog. 1393. Date: Thu, 19 Jul 90 13:22:49 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: weekly dump failed I tried to do a weekly dump last night but it didn't work. I kept getting IO errors in allspice's tape drive. This happened on two new tapes. Murder ran for a little while, then just hung. The daily dumps were not done, but I'm doing them right now. I plan on doing daily dumps until Bob gets back and he figures out what is wrong with the weekly dumps. 1394. Date: Thu, 19 Jul 90 16:45:46 PDT From: shirriff (Ken Shirriff) Subject: Install chokes on symbolic links In the kernel install, symbolic links are being installed as the actual files in the Installed directory. This is bad because mach/sun4.md/sun4/fpu points to .., which means the install ends up in an infinite loop following the links. 1395. Date: Sat, 21 Jul 90 17:28:01 PDT From: tve (Thorsten von Eicken) Subject: /sprite/cmds.sun4/mkmf corrupted A mail message replaced its end. Ironically, the mail was about mkmf... 1396. Date: Sat, 21 Jul 90 17:29:32 PDT From: tve (Thorsten von Eicken) Subject: Re: /sprite/cmds.sun4/mkmf corrupted I forgot to mention that I moved the bad file to /sprite/cmds.sun4/mkmf.bad 1397. Date: Mon, 23 Jul 90 08:56:36 PDT From: ouster (John Ousterhout) Subject: Corrupted files The following files were found to be corrupted over the weekend. Bob, can you restore them from dump tape? /sprite/lib/ds3100.md/term/tab450 /sprite/lib/fonts/pk/ilcmssb8.120pk I checked the boottimes file, and at the time of the corruptions Mint was running a kernel that does not have the bug fix in it (whew). At present, the "new" kernels *do* have the bug fix. Mint is now running this kernel, but I'm not sure what version Allspice is running (it's a JHH kernel). 1398. Date: Mon, 23 Jul 90 08:57:02 PDT From: ouster (John Ousterhout) Subject: Mint was dead when I came in this morning. It had run out of memory. I rebooted it with the .new kernel. 1399. Date: Mon, 23 Jul 90 19:06:09 PDT From: shirriff (Ken Shirriff) Subject: Allspice in debuggger Allspice went into the debugger with a Level 15 interrupt (31) at 0xf6056938 for no apparent reason, so I continued it. Anyone know why this happened? 1400. Date: Mon, 23 Jul 90 23:51:51 PDT From: root (The Sprite God) Subject: /user2 is unavailable I rebooted allspice and assault with the new kernels. Allspice came back up without any problems, but assault was unable to attach /user2. Fscheck gets a read error when it tries to read a sector in Disk_ReadSummaryInfo(). I made several attempts to boot the new kernel, and tried the old kernel too. But none of them were able to attach /user2. 1401. Date: Tue, 24 Jul 90 14:17:30 PDT From: shirriff (Ken Shirriff) Subject: Mint crash (whining) Mint crashed because /sprite filled up and it did: Fscache_Write: Alloc failed <1,1> "234" DISK FULL Fatal Error: Fscache_DeleteFile failed "43" blocks 1 flags 2800 Mendel says this is a known bug (thus this is whining). 1402. Date: Tue, 24 Jul 90 15:39:05 PDT From: Fred Douglis <douglis> Subject: migd and symmetry the migration daemon has apparently been unstable since blackmail has been running sprite. i think this is because blackmail's migd is somehow incompatible with the migd everyone else is running, which is causing two migration daemons to run in parallel despite the locking that normally inhibits this. the date of /sprite/daemons.sym/migd is yesterday. was this compiled from our sources? 1403. Date: Tue, 24 Jul 90 17:36:41 PDT From: shirriff (Ken Shirriff) Subject: Migration problem (whining) My pmakes wedge up with: MigOpenPdev: Error opening pdev /sprite/admin/migd/pdev (still trying): I/O error. It then takes a minute or so before I can get it to quit. 1404. Date: Tue, 24 Jul 90 17:40:36 PDT From: Fred Douglis <douglis> Subject: Re: Migration problem (whining) this is because the migd daemon, running on two ds3100s this afternoon, has died with bogus addresses each time. normally this would cause it to exit, but for some reason it's staying in the debugger and hanging rpc's to it. i'm installing a new ds3100 migd binary in the hope that the one that was installed accidentally had debugging enabled (thus disabling the code to kill itself on SIGDEBUG). this is likely, since it was also installed incorrectly (not setuid). i must have fouled something up somewhere along the line. as for why it's hitting the debugger in the first place... i'm investigating. 1405. Date: Tue, 24 Jul 90 22:35:52 PDT From: shirriff (Ken Shirriff) Subject: exec = -1 crash It looks to me that the problem with the exec = -1 crash is that mustard is sending a FS_CLOSE rpc to mint with the FS_EXECUTE flag, when the file wasn't open for execution. The client checks its use counters, so mustard must think the file is open for execution when it's not. Maybe mustard is getting its stream pointers messed up? I don't know why this would happen. In order to track this down, I recommend booting mustard with an instrumented kernel late at night to determine what mustard thinks it's doing. As well, we could put a sanity check in the server side of the rpc, so the rpc would fail if the FS_EXECUTE flag is set when it shouldn't be (instead of panicing). 1406. Date: Wed, 25 Jul 90 09:54:45 PDT From: ouster (John Ousterhout) Subject: Trashed Directory The directory /sprite/src/lib/tcl/tests suddenly lost its directory property today: it's now just a regular file (however, it still seems to contain the same bits it had when it was a directory). I've moved it to /sprite/src/lib/tcl/tests.bad, in case anyone cares to look at it, but I'm not sure what can be done at this point. 1407. Date: Wed, 25 Jul 1990 10:45:23 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Re: Trashed Directory It looks like when fscheck tried to read the directory the read failed. This failure propagated up a few levels in fscheck, where it decided that the directory was empty, so it changed it to a file. All of the files that used to be in the directory should be in lost+found (fscheck put 14 there). 1408. Date: Wed, 25 Jul 90 11:02:25 PDT From: Fred Douglis <douglis> Subject: ds3100 vm tlb bug piquante died during a migration with the following backtrace: > 0 panic(va_alist = -2146451376) ["sysPrintf.c":209, 0x800c13f4] 1 .block603 ["ds3100.md/vmPmax.c":1199, 0x800c6b18] 2 VmMach_PageValidate(virtAddrPtr = 0xc11abbcc, pte = 3252687542) ["ds3100.md/vmPmax.c":1199, 0x800c6b18] 3 VmPageValidateInt(virtAddrPtr = 0x80161400, ptePtr = (nil)) ["vmPage.c":652, 0x800cbd7c] 4 VmPageValidate(virtAddrPtr = 0xc11abbcc) ["vmPage.c":623, 0x800cbd08] 5 VmCOWCopySeg(segPtr = 0x80190fac) ["vmCOW.c":322, 0x800c88c8] 6 Vm_InitiateMigration(procPtr = 0xc02ad91c, hostID = 2, infoPtr = 0xc11abd10) ["vmMigrate.c":96, 0x800ca870] 7 Proc_MigrateTrap(procPtr = 0xc02ad91c) ["procMigrate.c":564, 0x800a18f0] 8 Sig_Handle(procPtr = 0xfc34, sigStackPtr = 0xc11abe3c, pcPtr = 0xc11abe38) ["signals.c":1205, 0x800bb330] 9 .block15 ["ds3100.md/machCode.c":1275, 0x800382a4] 10 MachUserReturn(procPtr = 0xc02ad91c) ["ds3100.md/machCode.c":1275, 0x800382a4] 11 MachSysCall(0xffffffff, 0x43709c, 0x7ddfa6f4, 0x43709c, 0xfc0c) ["ds3100.md/machAsm.s":1531, 0x80036ac4] i'm confused because VmMach_PageValidate does the following: retVal = VmMachWriteTLB(lowEntry, highEntry); if (retVal >= 0 && !(machPtr->modPage == virtPage)) { panic("VmMach_PageValidate: Non-modified user TLB entry found\n"); } and retVal is supposedly < 0 (it's -1071923224 == 0xc01bbfe8), so i don't see why the panic would be reached. [also, the code for VmMachWriteTLB says it doesn't return a value, though it implicitly returns a value by setting v0. i'll change that.] by the way, machPtr->modPage is 0 while virtPage is 65536. 1409. Date: Tue, 24 Jul 90 13:22:18 PDT From: shirriff (Ken Shirriff) Subject: Crashes this morning The crashes this morning seem to be due to mustard being a poison machine again. Mint would keep crashing after mustard was rebooted, with the use.exec = -1 problem. I told Pete to leave mustard dead until we figure out the problem. I'm hoping my debugging statements in the new kernel will let me figure out the problem. Otherwise I'll reboot mustard late at night and continue debugging. 1410. Date: Wed, 25 Jul 90 14:32:28 PDT From: Fred Douglis <douglis> Subject: spritemon won't compile (whining) it includes X11/Load.h, which no longer exists. it looks like perhaps the spritemon in /X11/R4/src/cmds/spritemon is actually an R3 version? spritemon doesn't properly report the number of remote processes due to a change to the kernel statistics buffer. 1411. Date: Wed, 25 Jul 90 14:45:57 PDT From: mendel (Mendel Rosenblum) Subject: Re: spritemon won't compile (whining) > it includes X11/Load.h, which no longer exists. it looks like perhaps > the spritemon in /X11/R4/src/cmds/spritemon is actually an R3 version? Someone removed the R3 header file X11/Load.h. Spritemon quit using this file when I converted it to R4 but I forgot to delete it from the include list. It's gone now. 1412. Date: Wed, 25 Jul 90 21:57:35 PDT From: shirriff (Ken Shirriff) Subject: Allspice crash (whining) Allspice crashed this evening and wouldn't respond to the keyboard or the reset button, so I power-cycled it. Several minutes before the crash it had consistency timeouts with blackjack. 1413. Date: Thu, 26 Jul 90 14:27:10 PDT From: Fred Douglis <douglis> Subject: inconsistent file length in trying to check to see what version of the kernel everyone was running, i came across a bad /hosts/blackmail/boottime file. however, it's only bad on machines other than blackmail. kvetching & larceny report the length of the file as 157, with contents Wed Jul 25 21:48:55 PDT 1990 blackmail SPRITE VERSION 1.165 (sym 17 Jul 90 11:21:00) :55:09 PDT 1990 blackmail SPRITE VERSION 1.165 (sym 17 Jul 90 11:21:00) while blackmail reports it as 85 with only the first line in the file. 1414. Date: Thu, 26 Jul 90 14:38:36 PDT From: mendel (Mendel Rosenblum) Subject: Re: inconsistent file length > in trying to check to see what version of the kernel everyone was > running, i came across a bad /hosts/blackmail/boottime file. however, it's > only bad on machines other than blackmail. kvetching & larceny report > the length of the file as 157, with contents Here's a wild guess at the problem. Blackmail is currently running with a format type that is not known by the rest of the Sprite system. This means that ioctl between machines (such as the truncate ioctl) wont work. Perhaps blackmail tried to truncate the file and it worked locally but not on the file servers. This demonstrates the main disadvantage of receiver makes right protocols. A new system can't talk until everyone has been updated to understand its language. 1415. Date: Thu, 26 Jul 90 14:43:04 PDT From: Fred Douglis <douglis> Subject: Re: inconsistent file length yes, that makes sense, since i already mentioned to fubar that mint & allspice had lots of "invalid format" messages in their syslogs. the same problem occurred with migd, which had to be relinked with a new c library that understood the sequent format. 1416. Date: Thu, 26 Jul 90 17:20:27 PDT From: culler (David Culler) Subject: Ehhh? Latex gets a segmentation fault runnig on cardamom, but not on gluttony. 1417. Date: Thu, 26 Jul 90 17:22:04 PDT From: Fred Douglis <douglis> Subject: bitrot (whining) whatever happened to the idea of recompiling the world every so often to make sure things were consistent? it turns out that the "rcssnapshot" program was last installed for the sun4 over a year ago, and it didn't have the fixes to permit it to operate on directories that are stored on a file server of a different byte order. surprisingly, there was still an unstripped executable image for the program, so i tried debugging that version to see what was wrong -- meaning i wasted time trying to debug rcssnapshot when it turned out only a reinstall was necessary. i suppose the simple answer is that everything will be relinked once the new unix-compatible calls are in place. i'm still surprised that we could go a year without recompiling system programs. 1418. Date: Sat, 28 Jul 90 10:01:36 PDT From: ouster (John Ousterhout) Subject: adduser and sprite-users (whining) The "adduser" script always adds the new user intot the "sprite-users" mailing list. Shouldn't it be modified to allow a choice of which mailing list to put the person in? 1419. Date: Sun, 29 Jul 90 22:04:13 PDT From: elm (ethan miller) Subject: problems with lpd on sun4c (whining) I was unable to print from terrorism, joyride, or sage to the Postscript printer (ps) on the 5th floor. Every time I tried, the message came back "ginger.Berkeley.EDU: /usr/lib/lpd: : Your host does not have line printer access" So why is this a Sprite bug? I was able to print from allspice, and I was able to check the queue from heresy. Allspice is entered in /etc/hosts.equiv on ginger, and heresy is not. Terrorism and sage are entered, and joyride isn't, so this doesn't seem to be a problem I had earlier with printing from Sprite (then again, it might be). 1420. Date: Mon, 30 Jul 90 08:40:40 PDT From: ouster (John Ousterhout) Subject: Mint crash (whining) Mint was dead when I came in this morning. It had run out of memory. I rebooted it with "sun3.new". This is the 2nd Monday morning in a row where this has happened; it makes me that (a) there's a core leak in the kernel, and (b) there might be some activity happening every Sunday night that is triggering the problem. Both times the crash has occurred while the checksummer has been running. However, when the crash occurred the checksummer was running on Allspice, not Mint. 1421. Date: Mon, 30 Jul 90 12:57:18 PDT From: pmchen (Peter M. Chen) Subject: talk incompatibility (whining) I can't use the "talk" program between unix machines and sprite. I get [Error on read from talk daemon : connection refused (61)] >From unix to sprite it gets hung on [Checking for invitation on caller's machine] Talk still works between sprite machines. Date: Mon, 30 Jul 90 13:05:27 PDT From: Fred Douglis <douglis> Subject: Re: talk incompatibility (whining) pete's mail was in response to my asking him this question. the problem isn't sprite, per se. sprite supports only ntalkd, while sunOS supports only the old talk. BSD has both talk and /usr/old/talk. i guess we could support both as well if someone cared to port it. 1422. Date: Mon, 30 Jul 90 17:35:07 PDT From: Fred Douglis <douglis> Subject: arson,violence arson bailed to a shell trying to boot my kernel (made from all uninstalled sources earlier today). did it need to be booted with something special, like "--c"? last time i looked, arson didn't have a disk, and i didn't see a note about how to boot it. violence didn't boot at all. i followed the instructions on the post-it note saying to boot "--c". i couldn't tell why it wasn't booting since the monitor is fried. 1423. Date: Mon, 30 Jul 90 17:50:18 PDT From: shirriff (Ken Shirriff) Subject: Allspice, assault crashes (whining) Allspice died with disk full, delete failed. Assault died with Vm_RawAlloc: out of memory. 1424. Date: Mon, 30 Jul 90 16:53:39 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: login syslog message and typeahead (whining) the new version of login prints "login failure" when one name is typed and then another one is. not only is this annoying if, for example, one mistypes an account name, but it also causes echoing to be disabled slightly later than expected --- meaning the start of one's password may be echoed inadvertently. 1425. Date: Tue, 31 Jul 90 10:42:12 PDT From: Fred Douglis <douglis> Subject: migd/sun4 cache writeback error bug (whining) no one ever reported the bug that mendel noticed several days ago, in which the global migration daemon crashes sun4s upon exit if someone performs an operation on one of its pdevs at the wrong time. it's present in the current .new kernel and will apparently be present in the next .new kernel, so better hope migd doesn't pick allspice (or anise) to run the global daemon on. 1426. Date: Tue, 31 Jul 90 11:44:33 PDT From: shirriff (Ken Shirriff) Subject: rup is confused (whining) rup gives: [...] nutmeg sun3 down 1+08:45 I'm on nutmeg and it's been up since the 27th. 1427. Date: Tue, 31 Jul 90 11:49:45 PDT From: Fred Douglis <douglis> Subject: Re: rup is confused (whining) sounds from the time it "crashed" -- 3am monday -- that it's related to the overnight file server crash. migd is pretty resilient, but i'm not at all surprised that the crash of a server would cause migd to flake out once every so often. i guess we could run a migd watchdog, like we do for the ipServer, but it doesn't seem worth it unless this presents a bigger problem. 1428. Date: Tue, 31 Jul 90 12:38:54 PDT From: mendel (Mendel Rosenblum) Subject: problems in sun4 mach module Bug 1) In sun4.md/machCode.c, routine Mach_Init(): Several consistency checks are done are panic() is called if they fail. Unfortunately the code implementing panic() hasn't been initialized yet so the machine prints "Fatal Error: " and hangs with interrupts disabled. This code should be changed to use Mach_MonPrintf() which will work before the sys module is initialized. Bug 2) In mach.h where MachSignalStack is declared there is a comment that says "This must a multiple of double-words!!!!". This comment is enforced by the "Fatal Error: " message if it is not observed. Unfortunately, MachSignalStack contains two structures (Sig_Stack and Sig_Context) from outside the mach module that contain no comments or guarantees of size or alignment. Before Ken added four bytes to Sig_Stack, Sig_Stack was 12 bytes and Sig_Context was 420 which added together to be double word size. After Ken added four bytes to Sig_Stack, the sun4 says "Fatal Error: " and hangs. The code in machCode.c and machAsm.s should be changed not to depend on MachSignalStack being double-word size. In sig.h, Sig_Stack was changed from: /* * Structure that user sees on stack when a signal is taken. */ typedef struct { int sigNum; /* The number of this signal. */ int sigCode; /* The code of this signal. */ Sig_Context *contextPtr; /* Pointer to structure used to restore the * state before the signal. */ } Sig_Stack; to /* * Structure that user sees on stack when a signal is taken. */ typedef struct { int sigNum; /* The number of this signal. */ int sigCode; /* The code of this signal. */ Sig_Context *contextPtr; /* Pointer to structure used to restore the * state before the signal. */ int sigAddr; } Sig_Stack; Ken, I believe that the Gods in charge of coding/commenting standards are punishing you. 1429. Date: Wed, 01 Aug 90 12:08:37 PDT From: Fred Douglis <douglis> Subject: X0msgs and disk space i believe / was filling because arson's X server was running in an infinite loop and printing messages to /usr/adm/X0msgs. at least, when i tried to use arson a few minutes ago, its X system was wedged and had lots of recent usage, and it refused to die with ^Z or ^C after an F1-k. 1430. Date: Wed, 01 Aug 90 16:12:18 PDT From: Fred Douglis <douglis> Subject: ds3100 ipServer not politically correct bks just complained that sprite machines are getting redirect msgs from csgw very often. turns out the ds3100 gets a redirect on every packet to, say, ginger. sun4cs are fine. 1431. Date: Thu, 2 Aug 90 09:09:59 PDT From: ouster (John Ousterhout) Subject: Mint crash Mint was down when I came in this morning (about 8:30) and apparently had been down only a few minutes. The crash was caused by an Address Error at 0xe039c02. The message immediately before that on the console log was: Fscache_UpdateAttrFromClient 44: 'localtime' <0,1028> short size 813, not 65335 I don't know whether this had anything to do with the crash or not. I rebooted with "new". 1432. Date: Thu, 2 Aug 90 13:06:12 PDT From: ouster (John Ousterhout) Subject: 1.070 flakey? (whining) Is anyone else experiencing flakiness with the 1.070 kernels? Piracy (a DS3100) has crashed twice today already with TLB LD miss exceptions, after being up for a week or more before that. In addition, Mercenary (a sun4c) hung mysteriously this morning and would not respond to keystrokes, mouse motions, or network packets. I L1-A'ed it and continued it, and it mysteriously came back to life again. 1433. Date: Thu, 2 Aug 90 13:20:58 PDT From: shirriff (Ken Shirriff) Subject: Re: Mint crash (whining) >Mint was down ... Address Error at 0xe039c02. >The message immediately before that on the console log was: >Fscache_UpdateAttrFromClient 44: 'localtime' <0,1028> short size 813, not 65335 I looked at the code, and the address error was in Fsrmt_RpcRead, when the code jumps through the lookup table: streamOpTable[paramsPtr->fileID.type].clientVerify. It apparently got a bad address from the table. This presumably means that the fileID.type was bad. This isn't directly related to the UpdateAttr message (I believe), but I think we can conclude that mustard was responsible for both problems. I think this same crash occurred on the 24th when we were having all the problems with mustard before (Mary, do you have it written down?). Conclusions: 1. Mustard is apparently sending packets with 3 types of errors: a) bad flags ... kills mint with exec count = -1 b) bad attributes ... causes Fscache_UpdateAttrFromClient warning c) bad file type ... kills mint with Address Error 2. These errors occur in different positions in the packet. (I was hoping it would be the same bit wrong in each case, but no such luck.) 3. Either there's some obscure software bug causing these three types of errors, only on mustard, or there's a hardware bug causing bad packets to go out. My bet is that mustard has a flaky network interface. 1434. Date: Thu, 2 Aug 90 14:16:05 PDT From: ouster (John Ousterhout) Subject: Piracy dead again (whining) Piracy has dropped into the debugger once again, with the message "TLB LD miss exception at PC 0x800b99c8". Although I wouldn't swear to it, I think this is the same error it got in the two previous crashes today. I've left Piracy in the debugger in the hopes that a DS3100 guru would be willing to take a look at it. 1435. Date: Thu, 2 Aug 90 14:17:48 PDT From: shirriff (Ken Shirriff) Subject: printf is full of bugs (whining) There are loads of bugs in printf. Where did it come from and can we get a new version from there? Or should I try to fix it, or should I copy a version from monet? The original bug I found was that the ' ' flag isn't implemented. In tracking that down, I discovered other problems such as the conversions 'i' and 'p' are missing, 'X' isn't handled properly, and many combinations of flags don't work right. 1436. Date: Thu, 2 Aug 90 14:23:51 PDT From: ouster (John Ousterhout) Subject: Re: printf is full of bugs (whining) I'm afraid I'm responsible for printf, which means that a new version isn't likely to be forthcoming. I don't believe that you'll be able to copy a version from monet: the last time I checked the BSD versions were all in assembler code; in any case they might not mesh with the rest of our stdio library. I wrote printf from the BSD man page, and I thought I got everything on the manual page but perhaps I missed something. For example, I don't see "i" and "p" options in my manual page; could these be later POSIX additions? Of course, it would be nice to have all that stuff in the Sprite printf. 1437. Date: Thu, 02 Aug 90 15:51:36 PDT From: Fred Douglis <douglis> Subject: Re: Piracy dead again piracy died at a point where it shouldn't have (as usual): Segmentation fault [.block534:676 +0x4,0x800b99c8] sched_Instrument.processor[cpu].idleTicksLow++; main: 108 Main_InitVars(); (kdbx) run Segmentation fault [.block534:676 +0x4,0x800b99c8] sched_Instrument.processor[cpu].idleTicksLow++; cpu was 0, sched_Instrument was fine, etc. 1438. Date: Thu, 2 Aug 90 18:12:00 PDT From: mendel (Mendel Rosenblum) Subject: migration considered harmful I evicted a process from jaywalk back to treason and it got a seg fault on treason. The process was /sprite/src/lib/gcc/sun4.md/as.sun3. It looks like the global registers got overwritten with the string: "vmMach.h" 1\n \n\n\n". 1439. Date: Thu, 2 Aug 90 18:14:30 PDT From: mgbaker (Mary Gray Baker) Subject: joyride crash (whining) Joyride crashed with the familiar problem of trying to schedule a call-back element that is still on the list (so it's pointers are NIL and are traversed while trying to insert it again). The element is the call-back for the RpcDaemonWakeup. I don't see why it happened. 1440. Date: Thu, 2 Aug 90 18:46:41 PDT From: mendel (Mendel Rosenblum) Subject: Migration considered harmful for global registers Sun4c's do not transfer the global registers, registers %g0 thru %g7, correctly during migration. The program in ~mendel/gtrash demos this. Fortunately, the compilers we uses aren't smart enough to use the global registers for anything but very short-term temporary values. 1441. Date: Fri, 3 Aug 90 10:32:48 PDT From: mendel (Mendel Rosenblum) Subject: rdate during booting doesn't work The (new?) /boot/bootcmds does a (rdate %timeServer &) well before it starts the ipServer. This doesn't always work because the connect() system call in rdate gets an error if the ipServer has not setup the tcp pdev. 1442. Date: Fri, 3 Aug 90 10:42:29 PDT From: mendel (Mendel Rosenblum) Subject: sun4c file cache limit was removed Before last night, sun4c's such as jaywalk executed a fscmd during boot that limited the size of the file cache. Without this limit, the file cache will grow until the machine explodes. Does anybody (Mary?) remember where and what the command to limit this was? Until this is added back, the sun4c's with more than 16 Megabytes of memory will be at risk. 1443. Date: Tue, 07 Aug 90 12:08:53 PDT From: Mike Kupfer <kupfer> Subject: sage crash Sage died last night, apparently around 0030 (12:30 am). There was a complaint about a kernel page fault at pc 0xf609fe44, addr=0xf8303ae4. Ken helped me look at it. We weren't able to debug it; sage didn't respond to the net or the keyboard. There were some console messages partially scribbled over the X desktop. Ken said they were a sign of a full disk. 1444. Date: Wed, 8 Aug 90 11:14:37 PDT From: mendel (Mendel Rosenblum) Subject: newly installed sun4 loader is broken The sun4 loader /sprite/cmds.sun4/ld that was installed Fri (Aug6) doesn't work. It incorrectly relocates routines when invoked with the "-r" option as done by our kernel builds. I copied the old object file from /sprite/cmds.sun4.old/ld and it appears to work. Any linking done on sun4 between Friday and now should be redone. Please test important software before installing it!!!!!!! (bitching) 1445. Date: Wed, 08 Aug 90 11:29:23 PDT From: Fred Douglis <douglis> Subject: bug fix: memory leak found the change i made many weeks ago to fix scavenging of pipe handles had an unintended side effect. i was inadvertently passing an extra argument in the middle of an argument list, so things were shifted and pipes were never getting released. lint would have told me about this, which confuses me since i thought i ran lint in my copy of fsio before installing my stuff. but lint hadn't been run in fsio itself since last fall. lint gets harder and harder to run as more unimportant messages are generated; it's been a long time since we went on a de-linting campaign, but perhaps it's due. (not that i exactly want to rush out and do this, of course :) i've made the fix but it hasn't been installed yet. we need to figure out when to push out a new kernel with this fix and with mary's /dev/fb changes. 1446. Date: Wed, 08 Aug 90 11:50:15 PDT From: Fred Douglis <douglis> Subject: tftpd vanishing allspice's tftpd has disappeared into thin air twice in the past two days, preventing clients from booting. 1447. Date: Wed, 8 Aug 90 13:49:24 PDT From: shirriff (Ken Shirriff) Subject: ds3100 rpc hung to allspice When I boot my ds3100 with ds3100 or new it hangs in the boot with: Importing "/" from host #14 <open> RPC timed-out open of "/" waiting for recovery But allspice is fine, so I don't see why it's hanging. Is there some daemon I should restart? 1448. Date: Wed, 08 Aug 90 15:23:35 PDT From: Mike Kupfer <kupfer> Subject: malloc dies when should return null ptr If there is insufficient memory for malloc to satisfy a request, it should return a null pointer. Instead it prints an error message and dies (see MemChunkAlloc). 1449. Date: Wed, 8 Aug 90 15:26:59 PDT From: ouster (John Ousterhout) Subject: Re: malloc dies when should return null ptr This arguably a bug, since it doesn't do what the official UNIX man page says, but it's intentional: the motivation is that there is unlikely to be anything you can do when you run out of memory (particularly in a VM system where there's a lot of memory); rather than returning NULL and forcing a zillion malloc callers to panic individually, Sprite just panic's automatically. 1450. Date: Thu, 9 Aug 90 10:38:05 PDT From: mendel (Mendel Rosenblum) Subject: Memory leak from net module The Net_InstallRoute system call leaks memory badly. Each time it is called it appears to lose 32 to 70 bytes. This was acceptable when it was called once for each host at boot time. Now that netroute is running once an hour from cron, Net_InstallRoute get calls for each sprite hosts. With 72 hosts this adds up to around 2.4 kilobyte/hour. That is 57.6 kilobytes per day and 403.2 per week. If a sprite machine stayed up a year.... John H, is this fixed in your rewrite of the net module? 1451. Date: Thu, 09 Aug 90 14:12:08 PDT From: Mike Kupfer <kupfer> Subject: Re: malloc dies when should return null ptr Well, either the code or the (Sprite) man page should be fixed. I would prefer that the code be fixed. I would be happy if we used the scheme Fred mentioned, where programs that want a completely UNIX-compatible interface can get it. Do we already have such a library? Perhaps there should be a separate "incompatibilities" section in man pages for UNIX routines? 1452. Date: Fri, 10 Aug 90 10:45:21 PDT From: tve (Thorsten von Eicken) Subject: can't boot sun3/60 I tried "b le(0,961b,43)sun3.new" as in jhh's mail, but it dies with an address error in PC fecth at b40ea (just after printing "SpriteBoot: ...". Same with trying "b le(...)sun3". 1453. Date: Fri, 10 Aug 90 12:12:25 PDT From: Fred Douglis <douglis> Subject: permissions for symm.md; mkmf bug there are symm.md directories strewn throughout the source tree, without group write permission for sprite, owned by fubar. this causes mkmf to fail. however, it doesn't abort with an error, it prints: "mkmf.tmp.sed: permission denied" and continues to run pmake. 1454. Date: Sat, 11 Aug 90 12:38:27 PDT From: shirriff (Ken Shirriff) Subject: Mail file trashed Between 12:11 and 12:34, my mail file got messed up. I got 1601 bytes of nulls on the end. This is the problem I encountered when the sending machine crashed before the data got flushed to the server. I had hoped that the fflush's would fix this, but apparently not. On the good side of things, my modifications to mail warned me that this happened, instead of silently mashing messages together as mail used to. 1455. Date: Sat, 11 Aug 90 17:38:26 PDT From: shirriff (Ken Shirriff) Subject: Strong evidence of mustard trashing packets >From my close trace information on mustard when it crashed allspice I got: /sprite/cmds/sed flags 1005 (FS_READ|FS_EXECUTE) /etc/zoneinfo/localtime flags 9001 (FS_READ) /sprite/admin/migd/pdev flags 9003 (FS_READ|FS_WRITE) Allspice complained about a bad close on localtime from mustard, which can only happen if FS_EXECUTE is set. Unfortunately I couldn't find out the entire packet that allspice received, but it's clear that the flags mustard was sending didn't have FS_EXECUTE set, but the flags allspice received did have FS_EXECUTE set. This shows the problem isn't a filesystem bug, but is a problem somewhere between the rpc on mustard and the rpc on allspice. I've been running a packet echo program on mustard for about a week, but it's failed to catch any bad packets. 1456. Date: Sun, 12 Aug 90 13:08:16 PDT From: mendel (Mendel Rosenblum) Subject: Re: Mail file trashed My mail also got 1601 bytes of nulls, but not at the very end. I got a couple more messages after the nulls. 1457. Date: Sun, 12 Aug 90 13:16:14 PDT From: Fred Douglis <douglis> Subject: tx and vi on sunos 4.1 don't mix rlogin to ginger from sprite using tx, and you can't run vi in visual mode. it apparently refuses to pay attention to the %TERMCAP. can we get a version of vi that will? 1458. Date: Sun, 12 Aug 90 17:28:47 PDT From: shirriff (Ken Shirriff) Subject: csh problem (whining) Sometimes when I enter a command before the previous one is done, I get a directory listing for no apparent reason. Does this happen to anyone else? 1459. Date: Sun, 12 Aug 90 17:41:49 PDT From: shirriff (Ken Shirriff) Subject: Re: tx and vi on sunos 4.1 don't mix You can use the tx vt100 emulation mode to do editing on ginger until ginger's vi is fixed to use %TERMCAP. (Ginger's vi uses different vt100 escapes from other machines, for no apparent reason, and I didn't emulate them all before. I've installed a new tx that should happily emulate the vt100 commands ginger uses. Let me know if there are any problems.) 1460. Date: Mon, 13 Aug 90 09:06:37 PDT From: ouster (John Ousterhout) Subject: Allspice crash When I came in this morning Allspice was down with a Level 15 Interrupt. 1461. Date: Mon, 13 Aug 90 09:13:26 PDT From: bmiller (Bob Miller) Subject: 'adduser' 'adduser' is not working for me anymore. reponds with... "Could't fetch entry from cad Make sure your machine is listed in /.rhosts" my machine, subversion, is in /.rhosts 1462. Date: Mon, 13 Aug 90 10:27:13 PDT From: rab (Robert A. Bruce) Subject: Re: 'adduser' Mint is the only sprite machine that is authorized to access the uid database on cad. All other machines try to redirect their accesses via mint. Since mint is turned off, `adduser' doesn't work. I will ask Brian to set everything up so allspice can access the database instead of mint. 1463. Date: Mon, 13 Aug 90 17:58:29 PDT From: Mike Kupfer <kupfer> Subject: "tar t" lists contents on stderr "tar tf mumble.tar | wc" displays the table of contents, followed by three 0's. "tar tf mumble.tar |& wc" just gives the wc counts. Workaround: use tar.gnu, or live with it (if you're sure there won't be any errors). 1464. Date: Tue, 14 Aug 90 14:25:53 PDT From: mgbaker (Mary Gray Baker) Subject: vt100 mode in tx I tried going into vt100 mode in tx, but vi still said it couldn't find the termcap. I tried it again and I got the following message: Mx_ReplaceBytes: bad range: first = (0,0), last = (24,0), num = 1 Also, in my syslog, I got this message: MachPageFault: Bus error in user proc 13524, PC = 4facc, addr = 0 BR Reg 80 The window was dead. 1465. Date: Tue, 14 Aug 90 15:41:17 PDT From: culler (David Culler) Subject: x windows locking up X-windows locked up on cardamom until I removed xrdb from my .xinitrc 1466. Date: Tue, 14 Aug 90 15:56:26 PDT From: Fred Douglis <douglis> Subject: migd race condition hit ds3100 looks like the same race condition that hit allspice & anise as a cache writeback error may also have hit a ds3100. piquante died with migd running, on a tlb fault for a segment with no swap file. 1467. Date: Wed, 15 Aug 90 09:06:23 PDT From: ouster (John Ousterhout) Subject: TCP problem on Sprite? >From karels@okeeffe.Berkeley.EDU Tue Aug 14 20:43:54 1990 From: karels@okeeffe.Berkeley.EDU (Mike Karels) To: eric@okeeffe.Berkeley.EDU, ouster@sprite.Berkeley.EDU Cc: sklower@okeeffe.Berkeley.EDU, van@okeeffe.Berkeley.EDU Subject: network problems between mammoth and paprika Date: Tue, 14 Aug 90 20:34:18 PDT This morning I noticed that network performance was again quite bad on our local nets, mostly due to overload on our gateway. As in the last few times I've noticed this, I found mammoth in a shouting match with another system, sending tiny TCP packets as fast as it could. This time the destination was not a PC/RT as usual, but was a Sprite system, paprika. As far as I can tell, this problem occurs only when there is a problem in the TCPs on both ends, although this may not be true for the client end (mammoth in this case; the X library requests that TCP not aggregate small packets). The problem in each case is that emacs was running on mammoth, using a window on an X server on the other machine. The window is closed ungracefully. On a PC/RT running AIX, the usual culprit is xdestroy (bound to a Cancel button). In Sprite's case, the problem appears to be a crash. The user "schauser" uses lisp on mammoth, which he runs in an emacs window under X on Sprite; I don't know if this is on paprika or one of its client machines. He said that his workstation crashes, then emacs on mammoth runs away. This has apparently happened several times, and after debugging a previous episode, Keith Sklower asked shauser to find the emacs and kill it after crashes. This didn't happen this morning, perhaps because he wasn't aware of a crash. After an hour or so of network lossage, I started debugging, and with Keith I found the same problem happening. Mammoth was sending 4-byte TCP segments to paprika as fast as it could, and paprika was periodically acknowledging them and moving the window forward. (Our 500-packet sample took less than a second to accumulate.) If the machine running the X display had crashed, however, the TCP connection should have been killed, and paprika should have reset the connection. On the other hand, if the connection was still alive on paprika but data couldn't be delivered to the X server (on another machine?), the TCP window should have gone closed. In any case, the data isn't really going anywhere, and the situation doesn't resolve itself without manual intervention. My trace begins with 228 4-byte segments from mammoth, followed by an ack from 4092 bytes back from paprika, then alternating 4-byte segments from mammoth and 4 bytes worth of ack from paprika. This is serious brain-damage in all ways: the connection shouldn't be alive, mammoth shouldn't send 4 bytes at a time with outstanding data, and paprika shouldn't acknowledge 4 bytes at a time with a data stream like this. The algorithms required to fix this are not only documented, but they are now required in a conforming TCP implementation. Although both ends are misbehaving here, fixing either end would have solved today's problem. This problem has been happening fairly often over some period of months (although I've only been aware of Sprite's involvement since this morning). The network is a shared resource that needs to be protected from such misbehavior. I'd like to see that both ends get fixed. Eric, I think that either someone must fix emacs (or the X library) on mammoth, or that emacs must be removed from mammoth. I don't know if this problem is specific to mammoth, but only mammoth has caused problems of this sort, and for multiple users using different X servers and host operating systems. I talked to Craig about this briefly; I think he would be willing to take this on. John, is someone currently maintaining TCP for Sprite who can fix this? I'm willing to explain the problem as needed. 1468. Date: Wed, 15 Aug 90 11:03:37 PDT From: Fred Douglis <douglis> Subject: portmap in loop portmap was pegging allspice's cpu again. i tried to debug it but it was in the sunrpc library and didn't have debugging info. i installed a new portmap, linked with libsunrpc_g.a, and started it on allspice. next time it goes haywire it would be good to debug it for real. 1469. Date: Wed, 15 Aug 90 12:00:53 PDT From: Fred Douglis <douglis> Subject: bootp stale handle ds3100s were unable to boot for a while because allspice's bootp was printing "recvfrom failed: stale remote file handle". does bootp use portmap? perhaps bootp needs to recontact portmap if portmap is reinstantiated. 1470. Date: Wed, 15 Aug 90 16:31:05 PDT From: rab (Robert A. Bruce) Subject: writable kernel sources A lot of the kernel sources have write permissions set, even though they are not checked out. The dev module is especially bad. Most of the source files in dev have permissions set to 0666. 1471. Date: Wed, 15 Aug 90 18:24:42 PDT From: mendel@rosemary.Berkeley.EDU (Mendel Rosenblum) Subject: allspice crashed Allspice hung up with an idle Proc_ServerProc having a lock on a file in /tmp. The file was being deleted and the delete was hung trying to reaquire the lock after the consist callbacks. 1472. Date: Wed, 15 Aug 90 18:26:46 PDT From: mendel@rosemary.Berkeley.EDU (Mendel Rosenblum) Subject: booting allspice from ginger doesn't work. Allspice doesn't boot off ginger. It hangs trying to broadcast for its inet address using rarp. Sounds like ginger needs to be configured to anwser RARP requests for allspice. 1473. Date: Wed, 15 Aug 90 23:42:46 PDT From: Fred Douglis <douglis> Subject: allspice hung again same as mendel's message before, as well as many, many deadlocks in the past. /tmp was locked by a process doing a consistency callback. the callback unlocks the hdr for the file (ctm-something-or-other) and grabs the monitor. it used to be that it would try relocking the file under the monitor. now it releases the monitor and relocks the file. this doesn't help, though, because ctm has been locked by a Proc_ServerProc and the process that has locked tmp blocks indefinitely. when i rebooted allspice, it took a "longer time than usual" to reboot. no indication of recovery after 25 minutes. i finally got worried and went back upstairs, and i found a message about no rpc servers, lots of processes in the ready state, and nothing apparently going on. i impulsively aborted and rebooted, thinking that it was the same problem i'd just spent a long time debugging, and this time i'd watch to see what was happening. wrong-o. it hit several "level 15 interrupts" and had to be powered off and on. in the meantime i realized it had actually started recovering while i was en route upstairs. oops. 1474. Date: Thu, 16 Aug 90 02:50:19 PDT From: rab (Robert A. Bruce) Subject: lots of lint warnings (whining) There were a lot of lint warnings while installing the kernel. The symm stuff is the worst because it has a lot of politically incorrect increment/decrement side effects and missing braces. But the other modules are pretty bad too. 1475. Date: Thu, 16 Aug 90 11:38:05 PDT From: bmiller (Bob Miller) Subject: having problems printing Our printer, lw533, doesn't seem to want to print today. Prof. Culler sent something to print, but it just sat there in the queue (it showed 'active'). SHALLOT, which drives the printer, is working and Terry had printed something earlier this morning. Right now, I've got a couple to test print jobs in the queue (one shows 'active'), but nothing is happening. Can you check into this? Thanks. 1476. Date: Thu, 16 Aug 90 12:55:23 PDT From: rab (Robert A. Bruce) Subject: king A lot of stuff on king is outdated. For instance it still uses the old /etc/passwd format, which makes it difficult keep consistent with Evans. 1477. Date: Thu, 16 Aug 90 15:48:21 PDT From: Fred Douglis <douglis> Subject: pipes still left around the fix i made did not cause pipes to get scavenged as expected. mendel and i had been speculating at one point about ken's change to fsPipe.c to deal with locking: rcsdiff -r9.{4,5} fsPipe.c RCS file: RCS/fsPipe.c,v retrieving revision 9.4 retrieving revision 9.5 diff -r9.4 -r9.5 13c13 < static char rcsid[] = "%Header: /sprite/src/kernel/fsio/RCS/fsPipe.c,v 9.4 90/06/27 11:16:50 douglis Exp % SPRITE (Berkeley)"; --- > static char rcsid[] = "%Header: /sprite/src/kernel/fsio/RCS/fsPipe.c,v 9.5 90/07/15 13:36:32 shirriff Exp % SPRITE (Berkeley)"; 347c347 < Fsutil_HandleRelease(handlePtr, TRUE); --- > Fsutil_HandleRelease(handlePtr, FALSE); i believe this may be causing the reference not to be removed. in any case, it appears that each time i create a pipe, the number of handles in limbo goes up by one. this wasn't the case a while back. 1478. Date: Thu, 16 Aug 90 16:16:14 PDT From: ouster (John Ousterhout) Subject: mach/sun3.md/cvtStat.o There is a file "cvtStat.o" in the sun3.md subdirectory of the mach kernel module. I couldn't find the source file for this object file. Does anyone know of a reason why it exists? 1479. Date: Thu, 16 Aug 90 16:22:53 PDT From: Mike Kupfer <kupfer> Subject: cruft in X11/cmds I made the mistake :-) of rebuilding all the X clients, so that they could take advantage of the readv fix. It looks like folks have been adding things to /X11/R4/src/cmds and then installing them in a haphazard fashion. Problem #1: Unless someone can free up some space in /X11, I don't think there's going to be enough room for all the games and whatnot that people have pulled off the net and installed. Problem #2: Some of the stuff doesn't build, at least on a Sparcstation. Known offenders are xditview, xdm, xcolors, and xpic (and I'm not even done rebuilding everything for the sun4; I haven't even started on the sun3 or ds3100). So, could someone who's more familiar with the contents of X11 than I am please see if there are any old files that can be nuked? If I run out of space, can I just start removing games, starting with what's in *.md/old? Also, what's the policy about stuff that doesn't build. Do we hide it away or leave it in with the stuff that does build? 1480. Date: Thu, 16 Aug 90 16:24:11 PDT From: Fred Douglis <douglis> Subject: disappearing dependencies.mk i happened to notice that sig didn't get installed in the last kernel install, and found out this was partly because sig/dependencies.mk didn't exist. why pmake wouldn't complain about the missing file is beyond me -- maybe there's another empty dependencies.mk in pmake's path someplace -- but it's worrisome. i checked around and found out that mach/sun4c.md/dependencies.mk also didn't exist. where are these files going? i am going to add a "make dependall" step to the howto file for building a new kernel. 1481. Date: Thu, 16 Aug 90 16:59:54 PDT From: Mike Kupfer <kupfer> Subject: "update" should be more careful It would be nice if "update" were a bit more careful when it installs something. I managed to lose sun4.cmds/xterm briefly because "update" deleted or moved the old version, then it found that it didn't have enough room to install the new one. 1482. Date: Fri, 17 Aug 90 22:24:05 PDT From: Mike Kupfer <kupfer> Subject: migration lost a process I was building X clients on piracy (a ds3100). [...] --- install --- /sprite/cmds.ds3100/update -m 775 -s -b /X11/R4/cmds.ds3100.old ds3100.md/makelev /X11/R4/cmds.ds3100/makelev Updating: /X11/R4/cmds.ds3100/makelev Error in Proc_Migrate: the process ID is not in the proper range or the process doesn't exist make: 1 error *** Error code 2 make: 1 error (makelev is part of the golddig game.) Rerunning "make" failed to reproduce the problem (what a surprise :-)). 1483. Date: Sun, 19 Aug 90 16:15:21 PDT From: tve (Thorsten von Eicken) Subject: portmap going crazy on allspice root 20e43 51.1 0.1 152 152 READY1128:41 /sprite/daemons/portmap I killed & restarted it. I also had to start a tftpd to be able to boot a client. 1484. Date: Sun, 19 Aug 90 16:48:15 PDT From: ouster (John Ousterhout) Subject: Fsio_StreamAddClient unlocks without locking In converting to the new synchronization code I discovered that the procedure Fsio_StreamAddClient executes UNLOCK_MONITOR without ever executing LOCK_MONITOR (the new synchronization code panics if you free something that's unowned). Can anyone think of a reason why the code should be the way it is? Does anyone know whether the error is a missing LOCK_MONITOR or a superfluous UNLOCK_MONITOR? For now I'm adding a LOCK_MONITOR call to my version of the file. 1485. Date: Sun, 19 Aug 90 18:57:12 PDT From: rab (Robert A. Bruce) Subject: dumps The nightly dumps did not complete over the weekend because of write errors on the tape drive. If you have any critical files, you should make redundant copies to avoid losing them. I will send mail as soon as the problem is fixed and the dumps are up to date. 1471. Date: Wed, 15 Aug 90 18:24:42 PDT From: mendel@rosemary.Berkeley.EDU (Mendel Rosenblum) Subject: allspice crashed Allspice hung up with an idle Proc_ServerProc having a lock on a file in /tmp. The file was being deleted and the delete was hung trying to reaquire the lock after the consist callbacks. 1473. Date: Wed, 15 Aug 90 23:42:46 PDT From: Fred Douglis <douglis> Subject: allspice hung again same as mendel's message before, as well as many, many deadlocks in the past. /tmp was locked by a process doing a consistency callback. the callback unlocks the hdr for the file (ctm-something-or-other) and grabs the monitor. it used to be that it would try relocking the file under the monitor. now it releases the monitor and relocks the file. this doesn't help, though, because ctm has been locked by a Proc_ServerProc and the process that has locked tmp blocks indefinitely. when i rebooted allspice, it took a "longer time than usual" to reboot. no indication of recovery after 25 minutes. i finally got worried and went back upstairs, and i found a message about no rpc servers, lots of processes in the ready state, and nothing apparently going on. i impulsively aborted and rebooted, thinking that it was the same problem i'd just spent a long time debugging, and this time i'd watch to see what was happening. wrong-o. it hit several "level 15 interrupts" and had to be powered off and on. in the meantime i realized it had actually started recovering while i was en route upstairs. oops. 1474. Date: Thu, 16 Aug 90 02:50:19 PDT From: rab (Robert A. Bruce) Subject: lots of lint warnings (whining) There were a lot of lint warnings while installing the kernel. The symm stuff is the worst because it has a lot of politically incorrect increment/decrement side effects and missing braces. But the other modules are pretty bad too. 1476. Date: Thu, 16 Aug 90 12:55:23 PDT From: rab (Robert A. Bruce) Subject: king A lot of stuff on king is outdated. For instance it still uses the old /etc/passwd format, which makes it difficult keep consistent with Evans. 1477. Date: Thu, 16 Aug 90 15:48:21 PDT From: Fred Douglis <douglis> Subject: pipes still left around the fix i made did not cause pipes to get scavenged as expected. mendel and i had been speculating at one point about ken's change to fsPipe.c to deal with locking: rcsdiff -r9.{4,5} fsPipe.c RCS file: RCS/fsPipe.c,v retrieving revision 9.4 retrieving revision 9.5 diff -r9.4 -r9.5 13c13 < static char rcsid[] = "%Header: /sprite/src/kernel/fsio/RCS/fsPipe.c,v 9.4 90/06/27 11:16:50 douglis Exp % SPRITE (Berkeley)"; --- > static char rcsid[] = "%Header: /sprite/src/kernel/fsio/RCS/fsPipe.c,v 9.5 90/07/15 13:36:32 shirriff Exp % SPRITE (Berkeley)"; 347c347 < Fsutil_HandleRelease(handlePtr, TRUE); --- > Fsutil_HandleRelease(handlePtr, FALSE); i believe this may be causing the reference not to be removed. in any case, it appears that each time i create a pipe, the number of handles in limbo goes up by one. this wasn't the case a while back. 1486. Date: Mon, 20 Aug 90 15:25:39 PDT From: culler (David Culler) Subject: For your records My DS3100 crashed with the following: fatal error VmPageServerRead: trying to read from non-existent swap file. version 1.070 (ds3100) (Aug 1 1990 13:18:21) PC0x800c256c. 1487. Date: Tue, 21 Aug 90 12:16:41 PDT From: pmchen (Peter M. Chen) Subject: oregano in pain It seems that oregano is in pain with lots of: <prefix> 8/21/90 12:15:21 oregano (38) RPC timed-out <prefix> 8/21/90 12:15:26 oregano (38) RPC timed-out <prefix> 8/21/90 12:15:32 oregano (38) RPC timed-out 8/21/90 12:15:32 oregano (38) - recovering handles 8/21/90 12:15:32 oregano (38) Recovery complete 8 handles reopened messages. This has been taking place for a while. 1488. Date: Tue, 21 Aug 90 12:17:01 PDT From: shirriff (Ken Shirriff) Subject: More mustard problems I booted mustard and got a bunch of: LE ethernet: Bogus receive interrupt. Buffer owned by chip. LE ethernet: Reinitialized chip. Rpc_Dispatch: bad channel 651231233 from clt 44 rpc10Resetting network interface It did this a bunch of times and then died. a) does this mean the network interface is dead and can be replaced? b) can I swap mustard for violence or arson or something to use? 1489. Date: Tue, 21 Aug 90 12:27:44 PDT From: rab (Robert A. Bruce) Subject: Re: oregano in pain According to oregano's console, it is timing out and recovering with forgery and garlic every 15 seconds or so. 1490. Date: Tue, 21 Aug 90 14:07:47 PDT From: rab (Robert A. Bruce) Subject: old sources (whining) In our source tree we have many directories that contain old sources. They are usually in directories with the prefix `old', such as `as.old', `pmake.old', etc. When pmake is run in a top level directory such as /sprite/src/cmds, it desends into these old directories and tries to make them. This often doesn't work because the old stuff isn't maintained and uses out of date header files, etc. This results in lots of distracting error messages and makes pmake abort. I propose that we either modify mkmf to ignore directories with a .old prefix, or even better, keep old sources in a seperate heirarchy. For instance we could have /sprite/src/old/attcmds /sprite/src/old/cmds /sprite/src/old/kernel etc. 1491. Date: Tue, 21 Aug 90 14:42:55 PDT From: shirriff (Ken Shirriff) Subject: xwindows, twm, or tx bug (whining) When I start up my window system, I don't have a cursor. Everything I type goes in the right window, but the cursor is invisible. If I move the mouse out and back in, the cursor becomes visible. 1492. Date: Tue, 21 Aug 90 17:18:41 PDT From: Fred Douglis <douglis> Subject: strange "bad user tlb fault" message on ds3100 both piquante & kvetching today have printed Bad user TLB fault in process : pc= addr= rather than something like Bad user TLB fault in process e0251: pc=4001ec addr=0 1493. Date: Tue, 21 Aug 90 16:12:20 PDT From: douglis@ginger.Berkeley.EDU (Fred Douglis) Subject: booting from ginger doesn't work twice in a row it hit "bus error" exceptions right after starting execution. also, the howto file mentions sun4.new instead of sun4.md/new in one place. 1494. Date: Wed, 22 Aug 1990 11:54:34 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Re: Memory leak from net module The new net module cleans up when a new route is installed. 1495. Date: Wed, 22 Aug 90 12:01:37 PDT From: bmiller (Bob Miller) Subject: print problem an 'lpq' on my machine shows 'no space on remote; waiting for queue to drain'. SHALLOT (the driver for printer lw533) shows 'no entries'. Is this a Sprite problem? 1496. Date: Wed, 22 Aug 1990 12:12:15 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Re: king One of the things on my list is to update cory. Right now they are running the 1.059 kernel from Feb 20. They sure don't make kernels like they used to! The update is likely to be a major one, since they are lacking the new passwd, boot sequence and X11R4, among other things. 1497. Date: Wed, 22 Aug 90 15:01:56 PDT From: pmchen (Peter M. Chen) Subject: oregano in pain I deleted my sole link to /spur2 (which was obsolete anyway), which didn't fix anything. I then rebooted, which fixed things. 1498. Date: Wed, 22 Aug 90 15:47:55 PDT From: ouster (John Ousterhout) Subject: Boing goofy for migration? It seems that every time I migrate something to boing during a pmake the thing never finishes. I finally migcmd-ed Boing not to import anything ever, and now my pmakes are all finishing fast. 1499. Date: Wed, 22 Aug 90 16:40:38 PDT From: ouster (John Ousterhout) Subject: Allspice hang Allspice hung up this afternoon ("RpcDoCall: <open> RPC to allspice is hung"). I went upstairs and reset its network interface (Break-N) and it recovered OK. 1500. Date: Wed, 22 Aug 90 19:59:56 PDT From: Fred Douglis <douglis> Subject: pipe limbo bug fixed i think i've fixed the problem with pipes being left in the limbo state. (i wanted to fix it in time to get a clean kernel that won't be slowed down scavenging unusable handles :) FsPipeGetIOAttr was grabbing the handle but unlocking it rather than releasing it. other GetIOAttr routines do a full release. the real question is why we only recently started noticing lots of handles accumulating, considering it seems this bug has been around forever. or have handles in the "limbo" state been accumulating steadily all along? 1501. Date: Wed, 22 Aug 90 22:53:45 PDT From: Mike Kupfer <kupfer> Subject: allspice temporarily stuck Allspice got very very busy. Break-N seemed to have some effect, but it didn't actually fix the problem. I tried an "rpcstat -srvr", which took a very long time to respond. Shortly after it did respond, the logjam cleared, and things went back to normal. Unfortunately, I managed to fill allspice's console before realizing that the problem had gone away, so I can't tell you what messages were displayed around the time of the rpcstat output. 1502. Date: Thu, 23 Aug 1990 10:50:49 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: unix compatibility bug(s) > From culler Thu Aug 23 09:41:21 1990 > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA272184; Thu, 23 Aug 90 09:41:21 PDT > Date: Thu, 23 Aug 90 09:41:21 PDT > From: culler (David Culler) > Message-Id: <9008231641.AA272184@sprite.Berkeley.EDU> > To: jhh > Subject: Ultrix Compatibility > > I've been trying to run idraw, compiled for Ultrix. I get a family of > strange behaviors, depending on where I run it from and when. In some > cases it tries to find servers for all of root. It gets screwed up about > the pathname of the file I want to open. If I get around these two, it > blows up with X-errors. Runs fine on dill. Runs o.k. for johnw. > Perhaps we could talk about it. > 1503. Date: Thu, 23 Aug 90 12:17:16 PDT From: rab (Robert A. Bruce) Subject: dumps The daily dump did not complete last night. /mic, /tic, /sprite/src/kernel and /scratch3 completed, but all other filesystems failed. The last successful dump was Wednesday morning. If you have modified any important files since then, you should make redundant copies. I will send mail as soon as the dumps are up to date. 1504. Date: Thu, 23 Aug 90 16:20:17 PDT From: Mike Kupfer <kupfer> Subject: inadequate instrumentation (whining) The recent problems with allspice have demonstrated that our system instrumentation tools are incomplete. "rawstat" has some of what I'm looking for, but it has the following problems: (1) it doesn't give CPU usage (broken down into user/system/idle) (2) it doesn't list interrupts (3) it can't give a running tally (a la "vmstat -t"). 1505. Date: Fri, 24 Aug 90 00:09:21 PDT From: douglis (Fred Douglis) Subject: tftpd bug patched when i tried rebooting machines, i lost many of them because tftpd kept disappearing on me. i finally had to run tftpd with some debugging info enabled. it became apparent that tftpd was hitting an error on a recvfrom and exiting. if it tried to restart by reopening the socket to accept requests, it would run into trouble if any of its children were still around and using the socket. so, i put in some patches to have it keep track of children and only restart when there are none around. this seems inordinately kludgy, and if anyone with an understanding of tftpd and/or the socket/bind/... routines wants to fix it, please do! 1506. Date: Fri, 24 Aug 90 02:35:45 PDT From: root (The Sprite God) Subject: removing file doesn't help disk space crunch a bug in a program to generate a 1MB file caused it to write 65MB and fill up /tmp. killing the process and removing the file didn't fix the problem -- instead, somehow a reference was left around and i had to reboot to get a reference to the file in lost+found, so i could truncate it by hand. 1507. Date: Fri, 24 Aug 90 04:45:45 PDT From: Fred Douglis <douglis> Subject: allspice/ginger (whining, whining, whinnying) various troubles tonight that i haven't yet reported: - i didn't see anything on allspice's console about where to put a kernel to boot off allspice's local disk. i wrote into / without success, then tried /allspiceA, and i found that i could boot "sd()new" but not "sd()clean" -- if i moved /allspiceA/clean into /allspiceA/new then i could boot my kernel. this was important since ginger was down for dumps and i couldn't boot off ginger. - allspice got level 15 interrupts left & right, and i couldn't get at a copy of kmsg to continue it until the dumps weren't running and i could type at ginger's console. (i didn't want to bring ginger up to full service just in case the dumper had more plans with it). - the bug i reported about printfs in the kernel omitting their arguments, and printing spaces instead, came up quite a bit during the benchmarking. it makes things hard to figure out sometimes. 1508. Date: Fri, 24 Aug 90 08:58:51 PDT From: pmchen (Peter M. Chen) Subject: xbiff doesn't exist in /X11/R4/cmds.ds3100 Would someone please re-install it? (it still exists in /X11/R4/cmds.sun4) 1509. Date: Fri, 24 Aug 90 09:20:54 PDT From: pmchen (Peter M. Chen) Subject: still can't boot Even though ginger is up, we still can't boot our machines (garlic and forgery in particular). 1510. Date: Fri, 24 Aug 90 10:55:29 PDT From: Mike Kupfer <kupfer> Subject: deletion of file doesn't change directory mod time If I delete a file, the modification time for the directory it was in remains unchanged. Is this deliberate? It's not Unix-compatible, and it confuses programs like xmh, which cache information about directories. 1511. Date: Fri, 24 Aug 90 15:56:53 PDT From: ouster (John Ousterhout) Subject: Re: booting I fixed Pete's booting problems, but then forgot to send a message about it (sorry). I fixed the problems by restarting portmap, bootp, and tftpd on Allspice. It looked like portmap was in an infinite loop. I tried to put it into the debugger, but "kill -DEBUG" didn't have any effect so I eventually "kill -KILL"ed it. 1512. Date: Fri, 24 Aug 90 16:01:31 PDT From: tve (Thorsten von Eicken) Subject: /user1 full & big lost+found I rm'ed old stuff in /user1/lost+found and got >10 Megs. Maybe some regular cleanup of lost+found would be useful? 1513. Date: Fri, 24 Aug 90 16:03:32 PDT From: Fred Douglis <douglis> Subject: portmap portmap was in an infinite loop yet again when i got to campus this afternoon. i was able to kill -DEBUG it, so i looked at it. turns out the usual "recvfrom -> error" problem was at fault. the sunrpc library always returned that a stream was okay rather than flagging it as bad and destroying it. i tried changing the library and running the uninstalled portmap to make sure it worked okay, but i hadn't installed it as of the time allspice rebooted. i'll do so now and restart portmap. 1514. Date: Fri, 24 Aug 90 17:45:48 PDT From: Fred Douglis <douglis> Subject: fclose(NULL) hits segv tex tries to check the access of a file by doing: ok = fclose(fopen(name_of_file, "w")) == 0; this causes a segmentation violation when fclose tries to flush the stream. the man page for fclose says: These routines return EOF if stream is not associated with an output file, or if buffered data cannot be transferred to that file. 1515. Date: Fri, 24 Aug 90 19:56:00 PDT From: gibson (Garth Gibson) Subject: file corruption I don't often use Sprite these days and I still receive plenty of mail here, so I went in to clean up and found this: garlic 1> mail Warning: encountered nulls at 77185. Mail spool file may be damaged. Does anybody want to look at it? I thought file corruption on Sprite was a thing of the past? 1516. Date: Sat, 25 Aug 90 11:12:41 PDT From: tve (Thorsten von Eicken) Subject: portmap on allspice in infinite loop again (I didn't touch it) 1517. Date: Sat, 25 Aug 90 13:27:35 PDT From: Mike Kupfer <kupfer> Subject: bogus stack backtrace on sun4 I was debugging portmap on allspice, and gdb told me that the stack backtrace looked like #0 0x486c in svcerr_noprog (xprt=(SVCXPRT *) 0x1bfffdf0) (svc.c line 325) 325 rply.acpted_rply.ar_verf = xprt->xp_verf; #1 0x4a1c in svc_getreqset (readfds=(struct fd_set *) 0x1bfffdf0) (svc.c line 432) #2 0x4d88 in _svcauth_unix (rqst=(struct svc_req *) 0x485c8, msg=(struct rpc_msg *) 0x186a0) (svc_auth_unix.c line 105) #3 0x2350 in main () (portmap.c line 134) This looks highly suspicious. svc_getreqset calls _authenticate, which should call _svcauth_unix via a jump table. (There's also the matter of why the backtrace doesn't show svc_run, which should be between main() and svc_getreqset.) And of course there's no path from svc_getreqset to svcerr_noprog... I had to do a "kill -ILL" to put portmap into the debugger (portmap was chewing up lots of CPU and "kill -DEBUG" was being ignored). Could that have anything to do with this weirdness (seems unlikely, though)? 1518. Date: Sat, 25 Aug 90 13:32:20 PDT From: Mike Kupfer <kupfer> Subject: mangled file in sunrpc sources /sprite/src/lib/sunrpc/DISCLAIMER is mangled. I assume we chalk it up to the filesystem problems Sprite was having earlier in the summer? More to the point, if this file is something we intend to distribute, we should probably install a clean version. Anyone know where to get a copy? 1519. Date: Sat, 25 Aug 90 15:58:02 PDT From: Fred Douglis <douglis> Subject: Re: rlogin dead on allspice oddly, telnetting to allspice, killing inetd, and restarting it caused rlogin to work again. 1520. Date: Sun, 26 Aug 90 13:18:01 PDT From: tve (Thorsten von Eicken) Subject: allspice down sun 3am - 1pm ... it wanted to sleep even longer than me ... It had a lot of "intel" error messages on the console. 1521. Date: Sun, 26 Aug 90 14:43:27 PDT From: gibson@apathy.Berkeley.EDU (Garth Gibson) Subject: background pmakes I've told Fred about this, but I thought I should report directly. I issued 100 simulations through pmake with .BACKGROUND: yesterday. This was done from garlic (ds3100) when there were about 9-10 available ds3100s. The first 20 simulations went wonderfully, I even did a compile while they were running and got full parallelism without trashing the simulations. Somewhere between the 20th and 30th simulation a problem developed. First, of 9 parallel jobs, 4 of them ended up on garlic running in parallel with only 5 jobs migrated to other processors (there was still about 9 available processors). Then the total number of jobs running at once dropped to 6 (still about 9 avail ds3100s) and half of these were running on garlic. When I looked at it today, the pmake was still running and one simulation was not complete. I looked into the pmake output and discovered that the 26th simulation was never issued. 1522. Date: Sun, 26 Aug 90 14:51:23 PDT From: gibson (Garth Gibson) Subject: C printf On the ds3100 my simulations generate NaN and print it as a normal floating point number. On SunOS the string "NaN" is printed. On the ds3100 the string printed is "n(NaN" where n is a single digit between 0 and 9. This is making quite hard for programs that read my simulations' output. 1523. Date: Sun, 26 Aug 90 15:46:40 PDT From: shirriff (Ken Shirriff) Subject: Nutmeg's monitor died. The image on nutmeg's screen suddenly folded up and came back, making a click noise and emitting a small puff of smoke. A few seconds later the monitor started making buzzing noises so I figured it was best to turn it off. Could the appropriate people get this fixed? 1524. Date: Sun, 26 Aug 90 15:58:11 PDT From: tve (Thorsten von Eicken) Subject: Fsconsist_RpcConsist I'm getting heaps of "Fsconsist_RpcConsist: <3,133696> delete msg from 14 dropped: no handle" messages on crackle (one for each file I rm). When deleting lots of files, this is a real pain. 1525. Date: Sun, 26 Aug 90 16:04:23 PDT From: Fred Douglis <douglis> Subject: Re: Fsconsist_RpcConsist the theory had been that this happens because lots of handles get used up for pipes in limbo, so as soon as you delete a file your handle for it gets scavenged. then the file server sends you a consistency message because it doesn't know it got scavenged. however, crackle doesn't have lots of handles in this state, so it's not clear that's really the problem. i've fixed the pipe problem (for the next kernel) but i think we should also just nuke the warning message, since it's unnecessarily verbose. if anyone objects to my removing this message, speak up. 1526. Date: Sun, 26 Aug 90 21:48:03 PDT From: gibson (Garth Gibson) Subject: long simulations on ds3100s One of the long simulations I was running in the background under pmake control detected an internal error and aborted. An identical invocation later on (in the foreground, not pmaked) completed with no error. I don't know much about the error; I have a core image - is there anything I can learned from it? I'm concerned because it was not the kind of error that crashes a program, it was an assertion of correct operation. This could either be migration messing up internal state or the magic "ds3100s do weird things sometimes" bug. 1527. Date: Sun, 26 Aug 90 21:55:49 PDT From: rab (Robert A. Bruce) Subject: Re: dumps The dumps are still not working. The last successful daily dump was on Wednesday morning. If you have modified any important files since then, you should make redundant copies. I don't know when we will get everything working again, but right now we don't have any properly functioning tapedrives. 1528. Date: Sun, 26 Aug 90 22:19:47 PDT From: Mike Kupfer <kupfer> Subject: Re: Fsconsist_RpcConsist One argument for retaining the message (perhaps in a less verbose form) is that it's pointing out behavior that we can't explain. If the situation that generates these messages is also responsible for a performance hit, then we want to get notified about it. 1529. Date: Mon, 27 Aug 90 10:56:42 PDT From: Fred Douglis <douglis> Subject: sparcstation 1+ has identity crisis john reported that migrations to boing were hanging. that's because boing didn't shrink its file cache and was thrashing its pmegs. the reason for that is that "hostname -type" prints sun4 (the default) rather than sun4c. it seems that the type for the 1+ needs to be promoted to a full-fledged machine type, since the value in the prom is different from existing sun4 and sun4c values. in the meantime, i'll change boing's bootcmds to run fscmd. 1530. Date: Mon, 27 Aug 90 11:14:27 PDT From: mendel (Mendel Rosenblum) Subject: More on sparcstation 1+ identity crisis The problem is that the machType and machArch aren't well defined for the sun4c series. Currently, The available machine types are: #define SYS_SUN_2_50 0x02 #define SYS_SUN_2_120 0x01 #define SYS_SUN_2_160 0x02 #define SYS_SUN_3_75 0x11 #define SYS_SUN_3_160 0x11 #define SYS_SUN_3_50 0x12 #define SYS_SUN_3_60 0x17 #define SYS_SUN_4_C 0x51 These numbers are returned by the PROM on the different machines. The problem is that 0x51 is the value for the 4/60 (SparcStation 1) while the sparcStation 1+ has a value of 0x53. The other sun4c's like the SLC and the IPC have different numbers. I suggest we nuke the symbol SYS_SUN_4_C and replace it with #define SYS_SUN_ARCH_MASK 0xf0 #define SYS_SUN_IMPL_MASK 0x0f #define SYS_SUN_4C 0x50 #define SYS_SUN_4C_60 0x51 #define SYS_SUN_4C_65 0x53 #define SYS_SUN_4 0x20 #define SYS_SUN_4_200 0x21 Note that 0xf0 is the architecture mask and 0x0f is the implementation mask. Instead of testing for machType == SYS_SUN_4_C we should do (machType & SYS_SUN_ARCH_MASK) == SYS_SUN_4C Mary, most of the test against machType are in the code that does the frame buffer stuff. It looks like this is the reason why the X server doesn't work correctly on the boing. 1531. Subject: tar doesn't understand -C Date: Mon, 27 Aug 90 13:18:24 PDT From: Mike Kupfer <kupfer> "tar cf sprite.tar .cshrc .login .newsrc Todo emacs \ -C /sprite/src/cmds ar/ar.c" gets me "-C: no such file or directory", and tar proceeds to add all of /spritte/src/cmds to the archive file. 1532. Date: Mon, 27 Aug 90 13:18:24 PDT From: Mike Kupfer <kupfer> Subject: tar doesn't understand -C "tar cf sprite.tar .cshrc .login .newsrc Todo emacs \ -C /sprite/src/cmds ar/ar.c" gets me "-C: no such file or directory", and tar proceeds to add all of /spritte/src/cmds to the archive file. 1533. Date: Mon, 27 Aug 90 13:38:58 PDT From: eklee (Edward K. Lee) Subject: prefix crashes raid1 when ... When doing prefix -l <dir> -M <dev> if the unit number of <dev> does not correctly encode the partition on which the filesystem is built, raid1 will sometimes crash. 1534. Date: Mon, 27 Aug 90 15:14:42 PDT From: Fred Douglis <douglis> Subject: allspice behavior explained; disk full problems revisited allspice was behaving VERY POORLY because garlic was trying to flush a file through to a full disk. there are two bugs here. one is that the file server grinds to a halt (programs can't even start up) when a disk fills. the client froze up too. allspice was printing cache messages repeatedly. the other is that the offending file wasn't behaving as expected: >>>>> On Mon, 27 Aug 90 14:17:12 PDT, gibson@sprite.Berkeley.EDU (Garth Gibson) said: >> i was wrong about this morning's runs >> those last 4 jobs were running and garlic wasn't down >> one job did abort and dumped core - more than 100 MB >> this filled the filesystem and appeared to freeze garlic >> i lost my windows and it was not responding >> when i deleted the file (from forgery) that was overfilling the >> disk, the other jobs went on (probably all running on garlic) >> why does deleting a file that a job is writing appear to truncate it? >> on Unix, I thought the delete made the file invisible but didn't >> affect the disk until the job stopped: >> forgery 10> ls Run/run.90 >> total 94966 >> 5 histo.db 94960 reli.dump 1 reli.trace >> forgery 11> rm Run/run.90/reli.dump >> forgery 12> df . >> Prefix Server KBytes Used Avail % Used >> /scratch3 allspice 480492 432442 0 100% >> forgery 13> ls Run/run.90 >> total 19 >> 13 histo.db 3 reli.db 3 run.100.out >> forgery 14> df . >> Prefix Server KBytes Used Avail % Used >> /scratch3 allspice 480492 432442 0 100% >> forgery 16> du -s Run/run.90 >> 75807 Run/run.90 >> forgery 17> ls -l !% >> ls -l Run/run.90 >> total 76138 >> 5 -rw-rw-r-- 1 gibson 4868 Aug 27 05:44 histo.db >> 76132 -rw-rw-r-- 1 gibson 99336192 Aug 27 14:02 reli.dump >> 1 -rw-rw-r-- 1 gibson 80 Aug 27 05:44 reli.trace >> forgery 18> rm Run/run.90/reli.dump >> forgery 19> df . >> Prefix Server KBytes Used Avail % Used >> /scratch3 allspice 480492 337502 94940 78% >> and later >> garlic 9> ls Run/run.90/reli.dump >> 10924 Run/run.90/reli.dump my understanding is that removing reli.dump (11) would make the file invisible, as garth expected, and that garth would be totally unable to delete it to free the space. (what i do to delete a file that a process is actively writing is copy /dev/null onto it first to truncate it.) the change in size that garth describes sounds like a nasty bug relating to caching behavior when the disk fills. 1535. Date: Mon, 27 Aug 90 17:06:07 PDT From: Fred Douglis <douglis> Subject: allspice consistency timeouts allspice was hanging lots of stuff. seems we have a big problem here: a backtrace showed that the process that was waiting for /tmp/ctmxxxx to get consistency replies had /tmp locked, so the guy who had / locked was blocked waiting for /tmp, and everyone else was blocked waiting on /. we have to make sure that the parent isn't locked during the consistency callback! 1536. Date: Tue, 28 Aug 90 11:42:49 PDT From: Fred Douglis <douglis> Subject: abort() incompatibility w/ unix ------- Forwarded Message Date: Tue, 28 Aug 90 11:42:18 -0700 From: gibson@apathy.Berkeley.EDU (Garth Gibson) To: douglis@sprite.Berkeley.EDU Subject: Re: pmake I have a signal handler for SIGQUIT that is intended to be used to invoke a dump. It seems as if calling abort() induced a SIGQUIT signal. Does this make sense? ------- End of Forwarded Message sure enough, the man page for abort() implies that it terminates the process with an illegal instruction, but sprite's abort sends SIG_DEBUG. shouldn't sprite send SIGILL and let that cause it to enter the debugger if that's the appropriate action? 1537. Date: Tue, 28 Aug 90 11:51:29 PDT From: Mike Kupfer <kupfer> Subject: idraw yields MachUNIXGetDirEntries error [This is a followup to bugs 29851 and 29902.] I merged the readv/writev fixes into the ds3100 kernel, but it didn't fix the problems with idraw. Problem #1: when starting idraw, there are some error messages "MachUNIXGetDirEntries: Bad directory format". Problem #2: idraw then tries to contact servers for /*. [David Culler has already reported both of these.] Problem #3: if I try to save a new drawing, idraw munges the path name ("/user2/kupfer/tmp/idraw.test" becomes "/swap1/kupfer/tmp/idraw.test"). It also generates a few more "MachUNIXGetDirEntries: Bad directory format" messages. The file doesn't get saved. 1538. Date: Tue, 28 Aug 90 11:55:15 PDT From: Fred Douglis <douglis> Subject: Re: idraw yields MachUNIXGetDirEntries error could the "bad directory format" problems be due to byte-swapping? i'll bet our fixes for byte swapping directories are in the user-level directory routines linked into the program via libc. kinda kills unix compatibility. 1539. Date: Tue, 28 Aug 90 13:06:45 PDT From: Fred Douglis <douglis> Subject: kernel stack limit garth's been running into trouble with hosts running out of processes because he runs a big pmake with several processes per task. as garth points out: the number of processes per machine need reflect the total processing power of all machines (of one type) not what you'd expect from a single machine so what can we do about it?? 1540. Date: Tue, 28 Aug 90 13:28:10 PDT From: shirriff (Ken Shirriff) Subject: Migration race in Proc_MigReceiveProcess Cardamom (ds3100) suffered the following death: Proc_MigReceiveProcess called Fs_DeencapFileState, which set up procPtr->fsPtr. It then called Fs_Open, but got FS_STALE_HANDLE. Then it called Fs_CloseState to clean up, but procPtr->fsPtr was now NIL. My guess is that something else cleared fsPtr before Fs_CloseState got it. 1541. Date: Tue, 28 Aug 90 15:30:31 PDT From: Mike Kupfer <kupfer> Subject: Caps Lock on Sparcstation My Caps Lock key seems to act as a shift key. That is, if I hold it down while typing, I get capital letters, but if I release it, I get lower case. Is this behavior deliberate? (I realize that most people seem to intensely dislike the Caps Lock key.) 1542. Date: Tue, 28 Aug 90 15:40:50 PDT From: tve (Thorsten von Eicken) Subject: can't boot I can't boot our sun3/60 with "be le(0,961c,43)sun3.new" anymore. I get a "tftp: file not found @ block 1" error. 1543. Date: Wed, 29 Aug 90 06:15:25 PDT From: Fred Douglis <douglis> Subject: allspice crash/disk state the benchmarking session went fine, but then when i rebooted allspice with 'new' again to go home, it crashed with the same old cache writeback bug (it started migd when the machine running migd didn't recover with it quickly enough). when it came back its disk was pretty frazzled, judging by the messages. be on the lookout.. 1544. Date: Wed, 29 Aug 90 11:08:06 PDT From: mendel (Mendel Rosenblum) Subject: SigMigSend() is called with wrong arguments in DeferSignal() The routine SigMigSend() is called with the wrong number of arguments in routine DeferSignal() in the Sig module. It get it to compile with function prototypes I corrected the number of arguments. The missing argument is the faulting address of the signal (I set to 0). This should probably be fixed to use the actual faulting address but it wasn't handy in the routine. I believe this means that migrated processes will get a bogus address for some signals. Anybody want to fix this? 1545. Date: Wed, 29 Aug 90 17:02:01 PDT From: Fred Douglis <douglis> Subject: crontab bug i asked joel why he was sending so many messages to himself in rapid succession. his response: ------- Forwarded Message Date: Wed, 29 Aug 90 16:59:41 -0700 From: joel@sprite.Berkeley.EDU (Joel A. Fine) To: douglis@sprite.Berkeley.EDU cc: rab@sprite.Berkeley.EDU Subject: Re: mail queue Actually, it looks like crontab has gone a little haywire. In an attempt to TEST crontab, I entered a line which looks like the following: * * * * * joel csh -c 'echo xxx | Mail -s "test" joel' According to normal crontab conventions, this is supposed to execute once every minute. Evedently, crontab is executing it as fast as my little cpu can handle it. This is a bug either in crontab or sprite or somewhere in between. I think this is why allspice went down a couple of times today. I'm sorry about the inconvenience that this caused. I've commented out the offending line (in /hosts/heresy/crontab, in case you're interested). - - Joel Fine ------- End of Forwarded Message 1546. Date: Wed, 29 Aug 90 18:27:58 PDT From: rab (Robert A. Bruce) Subject: Re: crontab bug The problem is that our library sleep() routine is too simple. int sleep(seconds) int seconds; { struct timeval tv; tv.tv_sec = seconds; tv.tv_usec = 0; (void) select(0, (int *) 0, (int *) 0, (int *) 0, &tv); return 0; } This doesn't work if select is interrupted before the timeout occurs. I replaced /sprite/src/lib/c/etc/sleep.c with the BSD version from monet. I relinked cron with the new sleep() and the bug disappeared. I don't know why select is getting interrupted so often. 1547. Date: Wed, 29 Aug 90 18:36:03 PDT From: shirriff (Ken Shirriff) Subject: include loop I'm having trouble getting the include files for the vm module to work with prototypes. The problem is the include files have a cycle, because of an extra include file I needed for the arguments. A simple example: vm.h: #ifndef _VM #include "proc.h" extern int foo _ARGS_((procType param)); typedef int vmType; proc.h: #ifndef _PROC #include "vm.h" vmType bar; typedef int procType; The problem is that the invocation of "vm.h" in proc.h is null, because include files only get included once. Thus the use of vm_type in proc.h comes before vm_type gets defined in vm.h. Complicating this, the loop isn't actually this simple; it actually goes through about 5 include files. This problem must have occurred before; how do I solve it? 1548. Date: Wed, 29 Aug 90 19:02:24 PDT From: Mike Kupfer <kupfer> Subject: Re: include loop One possibility is to edit vm.h so that the typedef's come before proc.h is included. This problem comes up fairly frequently in Mesa. The typical solution is to separate the definitions file into pieces. In this case we'd have a vmTypes.h and a vm.h. The basic types definitions (structs, typedefs) go into vmTypes, the rest goes into vm.h. You could also have a procTypes.h. How you split it up is partly a matter of taste, partly a matter of pragmatics (i.e., what does it take to get the sucker to compile). 1549. Date: Thu, 30 Aug 90 01:30:54 PDT From: rab (Robert A. Bruce) Subject: #@!%*&^# tapedrive (whining, cursing, etc.) The tapedrive situation is going from bad to worse. Brian and I put allspice's tapedrive on envy, and we got the same write errors that we had on Sprite (media error followed by file mark error). I figured this was a sign that it was a hardware problem, and not a Sprite software problem. Brian loaned us a working tape drive, so we could continue with our dumps until our drive is repaired. But tonight, on the very first file, we got the same write errors on the new drive. Brian said that they have used this drive a lot and have never had a write error. I tried several different tapes, and got write errors on all of them. In the meantime, I changed the dump script to put everything on murder's test disk. The tape drive we sent to be repaired is due back on Friday. It is supposed to have a new prom and a new read/write board. If we are lucky, that will fix the problem. 1550. Date: Thu, 30 Aug 90 10:34:07 PDT From: mendel (Mendel Rosenblum) Subject: filesystem deadlock on allspice Just for the record, allspice hung up yesterday with the following problem: The basic problem was that all the RPC servers were waiting on file handle locks. The root of this pile up was a directory that was locked during name lookup of a delete. Note that the parent directory remains locked during the entire delete process including the descriptor sync to disk and the consist callbacks. Unfortunately, the callback to heresy for this file was returning FAILURE because heresy was trying to open the file. Upon get a FAILURE the file server retries the consist RPC (busy waiting with RPCs over the network). Every 30 of these it prints a message. Heresy was rejecting this message because it thought that it had an outstanding open request for the file. Given the locked up state of the file server, the open would probably never finish until the consist finished but the consist was waiting for the open to finish. So here is a guess at the problem: Heresy opens, writes, and closes a file "/foo" Someone else trys to delete /foo. This locks / and causes a call back to heresy. Before the call back arives, heresy trys to open the file. Since "/" is locked it blocks on this lock. This causes the consist callback for heresy to return FAILURE and be retried. 1551. Date: Thu, 30 Aug 90 10:48:45 PDT From: ouster (John Ousterhout) Subject: Re: filesystem deadlock on allspice It seems to me that a "delete" operation should only lock the parent directory long enough to remove the child from it. Once the child name has been removed, then the parent directory can be unlocked while the file descriptor is cleaned up and consistence callbacks are made. Wouldn't this prevent the deadlock? Moreover, is there any need for consistency callbacks on a delete? If the file is open then its disk space is untouched (so caches needn't be flushed), and if the file is closed there's no need to call back because the caches will be flushed automatically on the next open. Given our past experience with bugs on file servers, I predict that once this problem starts happening (e.g. because the overall load on Allspice has increased recently) it's going to happen more and more often until we fix it. If this is the case, then we ought to start fixing it ASAP. Otherwise the system is going to become so unstable that it will be hard to keep it up long enough to fix the problem (remember last spring?). Is there a volunteer to take a closer look? In general, it seems to me that no directories should ever be locked while any consistency callbacks of any sort are made; otherwise there will be a deadlock potential between an open and a callback. The only reason for holding a parent directory locked is (a) to ensure that the child file doesn't go away before it is locked, and (b) to synchronize directory updates. Once the child handle is locked, there shouldn't be any need to keep the parent directory locked. I don't know how major of a change is required to implement this... 1552. Date: Thu, 30 Aug 90 13:25:05 PDT From: jhh (John H. Hartman) Subject: allspice reboot Allspice was rebooted an hour ago in an attempt to get it running a kernel with the additional lock debugging information. This did not work, because the kernel could not be loaded from disk. Fred has reported this bug already. I will try to reboot allspice again, this time from ginger. I'll also look into the problems with the boot program. 1553. Date: Thu, 30 Aug 90 13:30:25 PDT From: mendel (Mendel Rosenblum) Subject: Migration killing sprite The script in /sprite/src/kernel/fs.mendel.all/doit when executed with the arguments of mkmf (ie doit mkmf) kills the client it's running on with a seg fault in the kernel, kills one or more of the client machines of the same type, and causes the consist hangup on allspice. Or last least it did the last three times I ran it. 1554. Date: Thu, 30 Aug 90 15:24:13 PDT From: Mike Kupfer <kupfer> Subject: no profiled libc for ds3100? There's no /sprite/lib/ds3100.md/libc_p.a. Is this accidental or deliberate? We also seem to be missing the ds3100 sources for modf(). 1555. Date: Thu, 30 Aug 90 16:34:41 PDT From: Fred Douglis <douglis> Subject: migd/rpc bug treason was running the migration daemon. after recovery with allspice my machine hung. it turned out i wasn't hung up on allspice but had several rpc's hung to treason. "rpcstat -srvr" on treason showed every rpc daemon in the busy state. i tried killing migd to clear things up but it wouldn't die -- it must have been waiting on an rpc itself. an "l1-i" to list what things were waiting on caused treason to go into the debugger with "current process is nil". i really can't debug it now. 1556. Date: Fri, 31 Aug 90 13:07:46 PDT From: shirriff (Ken Shirriff) Subject: Strange consistency problem Violence got rpc timeouts with allspice for consistency sync, but allspice hadn't crashed. These timeouts only affected the rn I was running, which totally locked up and couldn't be killed; everything else worked fine on violence. After 1/2 hour, it hadn't unlocked itself so I tried debugging violence, but I couldn't get a decent stack trace for the rn process. 1557. Date: Fri, 31 Aug 90 16:20:40 PDT From: Fred Douglis <douglis> Subject: Reli crashed sun4cs treason & larceny died with the following: "Floating point exception with bad trap code, fsr = 0x%x\n", machStatePtr->trapRegs->fsr); (gdb) p/x machStatePtr->trapRegs->fsr %4 = 0x00068670 1558. Date: Fri, 31 Aug 90 16:22:37 PDT From: Mike Kupfer <kupfer> Subject: name/type clashes with system calls John H. and I just ran into the following problem. The kernel contains some routines from the user "net" library (-lnet). These routines #include both the kernel net.h and the user net.h, each of which declares Net_InstallRoute. Unfortunately, John's kernel version of Net_InstallRoute takes different arguments than the user version, so the compiler complains about conflicting type errors. This seems like a general, potentially hairy problem, caused by having (library) code that runs either in the kernel or in user mode. Our current solution is to "#ifndef KERNEL" out the Net_InstallRoute declaration in the user net.h. (Similarly, the kernel net.h should use "#ifdef KERNEL".) This seems pretty kludgy, though. We thought about forcing the two Net_InstallRoutes to take the same parameters, but that seems untenable in general. It would require that every system call stub take the same arguments as the internal version of the system call. Another thought is to use a different name for the internal version of the system call (e.g., the user calls routine Foo, which traps into FooStub, when then calls FooImpl). Either the library routine would have to have an "#ifdef KERNEL" so that it calls the right name, or we could use a cpp macro (e.g., in the kernel net.h) to fudge the name. Of course, this is pretty ugly, too. Anyone have ideas for a general solution? 1559. Date: Fri, 31 Aug 90 22:51:02 PDT From: Mike Kupfer <kupfer> Subject: fscheck complaints on allspice The past couple times I've been near allspice's console when it's booting, I've noticed complaints about .fscheck.out not being big enough. The message says something like "427 > 0", whatever that means. 1560. Date: Sat, 01 Sep 90 20:56:07 PDT From: Fred Douglis <douglis> Subject: blackmail bugs 1) blackmail started the global migd and was screwing things up, still. it doesn't seem to use the right files, or something. i added a "global-migd-prohibited" file for blackmail (and for all the sun3s while i was at it). 2) at least two symm binaries are not installed setuid when they should be. one of them is su, which means i have to su on another machine to chmod things, kill root processes, etc. the other is the migd binary. 1561. Date: Sat, 01 Sep 90 21:10:42 PDT From: Fred Douglis <douglis> Subject: more bugs for blackmail setuid doesn't seem to work, period. changing a process to be setuid, owned by root, doesn't get it to run as root. also, when i tried to recompile migd, it failed because (1) symm.md/md.mk referred to sym.md instead, and (2) mkmf wouldn't work because everything in symm.md was owned by fubar and not writable. i give up on blackmail. i'm removing its migd binary. 1562. Date: Sun, 2 Sep 90 17:23:13 PDT From: tve (Thorsten von Eicken) Subject: bogus ps output [crackle slides] ps -au | head Couldn't find migrated pid "24c31": the operation was successful. USER PID %CPU %MEM SIZE RSS STATE TIME PR COMMAND tve 2371c 19078.1 0.0 76 0 READY 0:02 mss_syll -f ... tve 43782 16442.5 4.6 776 752 READY 0:21 pmake tve d3731 7703.1 2.4 404 392 READY 0:01 mss_syll -f ... tve 373a 7045.3 7.8 3204 1272 RWAIT 22:36 X :0 tve c3722 5003.9 2.5 420 408 RWAIT 0:00 mss_deroff -f ... tve a3745 2916.5 3.0 1004 488 RWAIT 0:32 xterm tve 53741 2187.5 1.4 248 236 READY 0:00 mss_syll -f ... tve 8373e 1875.0 1.8 308 296 READY 0:00 mss_deroff -f ... tve a3746 1569.3 2.2 840 356 RWAIT 0:26 xterm 1563. Date: Sun, 2 Sep 90 23:07:39 PDT From: tve (Thorsten von Eicken) Subject: fsmakedev -p doesn't work correctly crackle-1# cd /dev crackle-2# rm audio crackle-3# fsmakedev -d 15 -p 666 audio crackle-4# ls -l audio c-w--wx--- 1 root 15, 0 Sep 2 23:06 audio* crackle-5# chmod 666 audio crackle-6# ls -l audio crw-rw-rw- 1 root 15, 0 Sep 2 23:06 audio crackle-7# exit 1564. Date: Mon, 03 Sep 90 14:15:54 PDT From: Fred Douglis <douglis> Subject: lost+found messages i am getting both empty messages ("you have files.." but no directories listed) and duplicate messages. 1565. Date: Tue, 04 Sep 90 10:48:35 PDT From: Mike Kupfer <kupfer> Subject: ANSI generic pointer type I think we're slightly ANSI incompatible, in that we typedef Address to be "char *" (when it should be "void *"). This isn't that big a deal, but we could conceivably get complaints when users compile their ANSI-compliant programs under Sprite. 1566. Date: Tue, 4 Sep 1990 12:53:05 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: migration problems I ran a pmake and got the following errors: Rpc_Call, bad serverID <-1> Warning: SigMigSend:Error trying to signal 11 to process 11f45 (ffffffff on host -1): an argument to a call was invalid Warning: SigMigSend:Error trying to signal 11 to process 51f6b (e144c on host 20): the specified process's user ID does not match the current process's uid 1567. Date: Tue, 4 Sep 90 16:03:33 PDT From: elm (ethan miller) Subject: bug in ps on sun4c For some reason, ps -au prints unrealistic (>2000%) numbers for CPU. This happens for quite a few processes, with the numbers starting above 2000 and dropping slowly down to 0.0. 1568. Date: Tue, 04 Sep 90 16:37:04 PDT From: Fred Douglis <douglis> Subject: Re: bug in ps on sun4c it turns out the problem is in the recent change to distinguish multiple machine types. i've installed a new ps that should fix the problem. 1569. Date: Wed, 5 Sep 90 11:29:29 PDT From: pmchen (Peter M. Chen) Subject: vagrancy crashed vagrancy crashed about 3 days ago. The error was Proc_RpcRemoteCall: unparsed call 1 returned 0 Entering debugger with a Breakpoint trap exception at PC 0x80094d6c. 1570. Date: Wed, 5 Sep 90 13:21:47 PDT From: shirriff (Ken Shirriff) Subject: allspice/oregano consistency Allspice is printing a bunch of messages: ClientCommand, write-back msg to client 38 file "rawstat.11:06:15.Z" <5,41218> failed 40012 Client state killed: 0 refs 0 write 0 exec Oregano is printing a bunch of messages: FsConsist_RpcConsist <5,41218> Writeback message from 14 dropped: no handle Which kernel has the changes to prevent the writeback errors? 1571. Date: Thu, 6 Sep 90 14:10:39 PDT From: shirriff (Ken Shirriff) Subject: ntalkd loop Ntalkd on allspice went insane and kept printing: <28>Sep 6 13:53:02 talkd[e0e47]: recv: stale remote file handle Each time I killed the ntalkd, a new one would start. The only way I could stop it was by removing /sprite/daemons/ntalkd. It was very difficult to fix the problem because the console screen was filled up with scrolling error messages. Suggestions: 1. Don't log messages to the screen, since they are saved in a file anyways. 2. Have some way of disabling syslog messages to the screen. 3. Pass the syslog messages to a filter which will slow the rate to something reasonable. 1572. Date: Thu, 6 Sep 90 15:58:05 PDT From: dingle (Adam T. Dingle) Subject: two bugs 1. On the Sun 3, if the MACHINE environment variable is set to the name of a machine which the C compiler does not recognize, the compiler will dump core. (This does not happen on the Sun 4.) Example: % ls Makefile linkdata.c tags.h % setenv MACHINE vax % cc linkdata.c Segmentation violation % 2. On the Sun 3 (and possibly on other machines), if a command (such as the preceding cc command) dumps core, pmake will terminate quietly, without any indication that the command did not execute properly. Example: % ls Makefile linkdata.c tags.h % setenv MACHINE vax % make cc -Dvax -o linkdata linkdata.c % ls Makefile linkdata.c tags.h % 1573. Date: Thu, 06 Sep 90 16:04:09 PDT From: Fred Douglis <douglis> Subject: Re: two bugs you don't ever, ever want to reset your MACHINE environment variable. too bad this can't really be enforced, though. maybe there could be a special mechanism for getting the value of MACHINE from the kernel instead of the normal environment. as for pmake terminating quietly, this is true on all machines. if you run pmake and it executes a process locally that hits a segmentation fault, you'll get a message on /dev/syslog saying your process went into the debugger. if you run remotely, the process gets killed, which we believed was better than going into the debugger quietly. sprite needs to (1) support controlling ttys, so you can see messages wherever you are, and/or (2) get rid of the "debug" state. 1574. Date: Thu, 6 Sep 90 17:11:01 PDT From: pmchen (Peter M. Chen) Subject: vagrancy won't boot. No response after I type boot -f tftp()new 1575. Date: Fri, 07 Sep 90 13:43:07 PDT From: Mike Kupfer <kupfer> Subject: arguments to signal handler Adam (Dingle, I assume) pointed out to me that signal.h talks about how the sigcontext struct is "made available to the handler to allow it to properly restore state if a non-standard exit is performed." However, at no place in signal.h or any of the expected man files is there a description of just what arguments are passed to the handler (or in what order). 1576. Date: Fri, 7 Sep 90 15:43:37 PDT From: bmiller (Bob Miller) Subject: printer problem I seem to be having a problem printing on lw533. In addition to my job, Fred's got one out there, too. 'lpq' shows 'waiting for queue to be enabled on shallot'. HELP!!!!!!!!!!!!! 1577. Date: Sat, 8 Sep 90 10:57:16 PDT From: dingle (Adam T. Dingle) Subject: can't execute program loaded with -n on sun 3 On the Sun 3, I can't seem to execute programs which I load with the -n option (to make their code segments sharable). Example: % cat foo.c main() { printf("hello, world\n"); } % cc -n foo.c % a.out a.out: permission denied. % The console reads: Proc_Exec: Can't run sun3 NMAGIC executable file on sun3. Any suggestions? 1578. Date: Sat, 8 Sep 90 11:43:48 PDT From: dingle (Adam T. Dingle) Subject: UNIX-domain sockets under Sprite Are UNIX-domain sockets (i.e. those created for protocol family PF_UNIX) supported under Sprite? If so, where is the sockaddr_un structure defined, used for specifying UNIX-domain addresses, defined? I can't seem to find it in any of the .h files in /usr/include or /usr/include/sys. P.S. Is there an e-mail address for questions (as opposed to bugs) about Sprite? 1579. Date: Sun, 9 Sep 90 14:41:58 PDT From: mendel (Mendel Rosenblum) Subject: Short read()s not Unix compatible The man page for the read() system call states: Upon successful completion, read and readv return the number of bytes actually read and placed in the buffer. The system guarantees to read the number of bytes requested if the descriptor references a normal file that has that many bytes left before the end-of-file, but in no other case. This guarantee is not held in the Sprite file system in the face of the file caching filling with dirty blocks. The problem is that Fscache_Read() returns FS_WOULD_BLOCK if it can't fetch a cache block because the cache is full of dirty blocks. If the block fetch that fails is not the first block of the read it will also return the number of bytes read. The main read loop in Fs_Read() changes the status from FS_WOULD_BLOCK to SUCCESS if bytes are returned. This causes the read the return before it has reached end of file. Code that looks like: if (read(fd, buf, fileSize) != fileSize) { panic("..."); } fails on Sprite but works on Unix. This problem appears to be similiar to the problems in the Fs_Write loop that John Hartman had with device writes. John do you think your fix will work for the Read code? 1580. Date: Sun, 9 Sep 90 16:39:57 PDT From: mendel (Mendel Rosenblum) Subject: allspice crashed Allspiced deadlocked /tmp today. The problem was a handle for file in /tmp was locked by an idle Proc_ServerProc. The tracing of PCs on monitor locks didn't help because file handles have there own locking and lock tracing mechanisms. Also, it took over 25 minutes for allspice to complete recovery with the machines in 477 evans. 1581. Date: Mon, 10 Sep 90 05:32:34 PDT From: rab (Robert A. Bruce) Subject: finger Finger was getting a segmentation violation. It looks like /sprite/admin/userLog is corrupted. I moved it to userLog.bad, and now finger works again. 1582. Date: Mon, 10 Sep 90 18:20:42 PDT From: mendel (Mendel Rosenblum) Subject: typedef Address must be "char *" I changed the typedef of Address back to a "char *" from a "void *". The comment on Address says: /* * An address is just a pointer in C. It is defined as a character pointer * so that address arithmetic will work properly, a byte at a time. */ and you can't do address arithmetic on a "void *". 1583. Date: Mon, 10 Sep 90 18:21:56 PDT From: shirriff (Ken Shirriff) Subject: tx clear doesn't always work If I'm rlogged in, "clear" doesn't work about 1/10 of the time. Sometimes it takes up to 3 clears before the screen actually gets cleared. (The particular case is rlogged in to sage (sun4) from violence (ds3100).) 1584. Date: Tue, 11 Sep 1990 13:54:03 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: fsmakedev fixed Fsmakedev has been changed so that the -p option expects octal numbers, rather than integers. This fixes bug 1563 reported by tve. 1585. Date: Tue, 11 Sep 90 15:53:10 PDT From: douglis (Fred Douglis) Subject: swap space as part of the disk space reorganization, we must move /swap1 to a larger disk. it's full right now just from normal accumulation of processes (not huge simulations or anything). 1586. Date: Tue, 11 Sep 1990 16:51:40 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: more on swap Seeing as how we don't have too many large disks why don't we just create a /swap2 and move some of the machines there? Our disk re-organization will be freeing up a number of disks. Is there some reason why multiple swap disks are bad? 1587. Date: Tue, 11 Sep 90 16:04:43 PDT From: mendel (Mendel Rosenblum) Subject: ds3100 compiler bug? When compiling the timer module (with -O) for the ds3100 you get the message: --- ds3100.md/timerTick.o --- uopt: Warning: TimerTicksInit line 61: multiplication overflow Line 61 looks like: TimerTicksInit() { timer_IntOneMillisecond = 1000; timer_IntOneSecond = ONE_MILLION; timer_IntZeroSeconds = 0; timer_IntOneMinute = timer_IntOneSecond * 60; timer_IntOneHour = timer_IntOneSecond * 3600; /* line 61 */ bzero((Address)&timer_TicksZeroSeconds, sizeof(timer_TicksZeroSeconds)); } The timer_ variables are declared as unsigned ints. It appears to generated the correct code. 1588. Date: Tue, 11 Sep 90 16:57:52 PDT From: mendel (Mendel Rosenblum) Subject: problem between utils and dev module The utils and dev modules disagree of the arguments to Dev_RegisterConsoleCmd(). The dev module things it wants a pointer to a function returning void and taking no argument but the utils module is passing it a pointer to a function returning void and take a clientData as an argument. 1589. Date: Tue, 11 Sep 90 17:16:07 PDT From: Mike Kupfer <kupfer> Subject: Re: problem between utils and dev module The utils module has it right. See Dev_RegisterConsoleCmd. 1590. Date: Wed, 12 Sep 90 15:15:22 PDT From: rab (Robert A. Bruce) Subject: oregano out of memory Oregano ran out of memory again. Here are the memory trace stats: (gdb) (gdb) print Mem_PrintStatsInt() Total allocs = 21574743, frees = 21484353 Small object allocator: Size Total Allocs In Use 24 48716 4085039 48716 32 18508 1974779 17940 40 9340 5005734 6719 48 3676 1992688 3270 56 2716 1306982 2579 64 748 177914 173 72 5660 1577037 5315 80 204 18918 2 88 988 836129 866 96 92 17021 62 104 60 422 52 112 4 533 1 120 12 11532 2 128 28 661442 2 136 2460 936209 2151 144 124 894320 46 152 4 95 0 160 12 20 6 168 4 35 0 176 28 1252 15 184 4 3679 2 192 4 9846 1 200 4 2400 0 208 4 4200 0 216 92 9689 82 224 4 846 0 232 4 500 0 240 4 107 0 248 12 56 3 256 12 75 0 264 4 1126 0 280 1116 20053 126 328 2460 915081 2152 1536 4 611103 0 4112 92 53106 80 Total 97204 21129968 90363 Bytes allocated = 4940032, freed = 752664 Large object allocator: Total bytes managed: 1326440 Bytes in use: 495040 Orig. Size Num Free In Use 1016 1 0 1 2576 2 0 2 336 1 0 1 152 2 2 0 528 5 1 4 400 3 0 3 1040 34 33 1 64 1 1 0 272 5 4 1 768 1 1 0 464 2 2 0 424 2 0 2 1048 26 26 0 80 1 1 0 1080 1 1 0 232 1 1 0 40976 2 0 2 5912 2 0 2 16 4 4 0 12304 3 0 3 1008 1 1 0 352 4 4 0 632 1 1 0 992 1 1 0 792 2 2 0 320 3 3 0 920 1 1 0 784 1 0 1 264 1 1 0 520 1 1 0 576 1 1 0 344 1 1 0 1528 1 1 0 144 1 1 0 49168 1 0 1 312 2 2 0 224 1 1 0 304 1 1 0 The kernel crased with ``Vm_RawAlloc out of memory'' while it was executing Net_InstallRoute(). 1591. Date: Wed, 12 Sep 90 15:16:41 PDT From: rab (Robert A. Bruce) Subject: new kernel The new kernel doesn't work on sun3's. JohnH says it is because of a compiler bug. 1592. Date: Thu, 13 Sep 90 14:39:08 PDT From: Fred Douglis <douglis> Subject: complaint about BUFSIZ i tried making a change to proc and recompiling. it gets an error message because BUFSIZ is defined in both file.h and stdio.h. i take it stdio.h didn't used to be included by kernel files. does anyone know enough about this stuff to know if #ifdef KERNEL #ifndef NULL #define NULL 0 #endif #define BUFSIZ 4096 #define const #else #include <sprite.h> #include <stdio.h> #include <stdlib.h> #endif can be replaced by #include <sprite.h> #include <stdio.h> #include <stdlib.h> ?? 1593. Date: Fri, 14 Sep 90 18:02:42 PDT From: Fred Douglis <douglis> Subject: ranlib broken for ds3100??? i tried making a new libc debug library but got compiler errors when trying to use it. it seems that ranlib is now a no-op. actually, it seems to hit a TLB fault, which looks like a no-op when run under migration. what gives? 1594. Date: Sun, 16 Sep 90 23:42:30 PDT From: Mike Kupfer <kupfer> Subject: telnet doesn't register user If you telnet into sage, your login doesn't seem to get registered. Certainly "finger" has no record of it. 1595. Date: Mon, 17 Sep 90 18:32:24 PDT From: shirriff (Ken Shirriff) Subject: Compiler bug. libc/sun3.md/stdio won't compile because it chokes on math68881.h. I'm compiling on a sun4. The problem is in __asm definitions that use floating point registers. This only happens with the -msoft-float flag. It dies with /sprite/src/lib/include/sun3.md/math-68881.h:364: inconsistent operand constraints in an `asm' 1596. Date: Mon, 17 Sep 1990 23:11:53 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: cc broken The new cc doesn't work right. Every so often the name of the temporary file passed to cpp is "/tmp.cpp" instead of "/tmp/cc123456.cpp". This causes cpp to exit with: /sprite/cmds.sun4/cpp: /tmp.cpp: invalid argument 1597. Date: Tue, 18 Sep 90 09:34:07 PDT From: Fred Douglis <douglis> Subject: allspice sendmail catatonic i noticed a distinct lack of mail, and i checked on allspice. its sendmail daemon was around but not responding. 1598. Date: Tue, 18 Sep 90 11:27:09 PDT From: shirriff (Ken Shirriff) Subject: sun3 compiler bug The following program doesn't work on the sun3 when compiled with -O: main() { double d; d = 0.; printf("start: d = %f\n",d); printf("start: d = %f\n",d); if (d==0) printf("zero\n"); } It outputs 0, then (NaN). The problem is d is stored in register fp2, which is trashed when the first printf returns. The problem is either the compiler is incorrectly assuming fp2 is preserved, or printf is incorrectly trashing fp2 (perhaps the assembly code in the math library is wrong). This problem predates my printf change and Bob's compiler change yesterday. I've examined the sun3 bug I reported earlier and the problem is that register fp2 is getting destroyed during system calls. Anyone know why this would happen? 1599. Date: Tue, 18 Sep 1990 12:13:05 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: pmake in background If I stop a running pmake and put it in the background I get lots of the following messages: *** Stopped -- signal 11 --- sun3.md/netLE.o --- I get many messages for the same file. Eventually the machine crashed, although I'm not sure if it is related. Also, I got a few "child not in table" messages. 1600. Date: Tue, 18 Sep 90 12:34:42 PDT From: Mike Kupfer <kupfer> Subject: "cc: not found" on sun3 When I run a make in a private test directory, I get the following: make loan cc -g -c loan.c cc: not found *** Error code 1 If I use "pmake" instead of "make", it works. 1601. Date: Tue, 18 Sep 90 16:42:44 PDT From: Mike Kupfer <kupfer> Subject: random "Error code 1" doing make on ds3100 I'm trying to rebuild the C library, using mustard. It'll go for so long, then I'll get something like: --- ds3100.md/Host_ByID.go --- cc -O -Dds3100 -Dsprite -Uultrix -I. -Ids3100.md -I/sprite/lib/include -I/sprite/lib/include/ds3100.md -g3 -c Host_ByID.c -o ds3100.md/Host_ByID.go *** Error code 1 No error message, just "error code 1". If I rerun make, it gets past this one and then quits someplace later. 1602. Date: Wed, 19 Sep 90 09:13:35 PDT From: ouster (John Ousterhout) Subject: Allspice crash When I came in this morning Allspice was wedged up printing continuous messages on its console about writebacks dropped, mostly from client 44 (our old friend mustard) but also some from 57 (clove). I couldn't get any response at all from the console, so I reset it and rebooted. By the way, when allspice rebooted piracy reopened 1335 handles, even though it had no window system running and no processes at all except login and possibly migrated processes. Does anyone have any idea why so many handles would be reopened? Could this be related to the pipe handle leak we were (are?) experiencing? P.S. Allspice is now running 1.075, the new kernel compiled by Mendel last week. 1603. Date: Wed, 19 Sep 90 10:46:50 PDT From: shirriff (Ken Shirriff) Subject: X messed up X has suddenly stopped accepting DISPLAY="sprite:0" and I must use "violence:0". With display "sprite:0" I get the message Tx quitting: couldn't find display (missing DISPLAY environment variable?). So what's changed since yesterday? 1604. Date: Wed, 19 Sep 90 10:52:18 PDT From: gibson (Garth Gibson) Subject: Background simulations Although pmake/mig have helped get my simulations done, they fail routinely. Last night I left a collection of simulations running on sun4s and a different collection running on ds3100s. This morning little more is done than when I left. One the sun4 side, the pmake host crashed during the night. This is the second time in 24 hours that a sun4 pmake host has crashed running my simulations. On the ds3100 side, pmake and all its children are suspended, apparently waiting for an idle machine (there are of course lots of idle machines including the pmake host). I will restart simulations soon and check them periodically. If I constantly babysit, I have found that I can get alot of work done. And it is still easier than directly running jobs on 10-20 machines. 1605. Date: Wed, 19 Sep 90 10:57:17 PDT From: Fred Douglis <douglis> Subject: Re: Background simulations as i mentioned to garth in a separate note, i think the reason for pmake locking up is allspice's crash. the migration daemon was screwed up in some fashion -- e.g., this morning after allspice returned, i got an error from finger "cannot open migration database". as for the sun4s crashing, this was reported a while ago and i don't think anything has changed. 1606. Date: Wed, 19 Sep 90 12:03:30 PDT From: elm (ethan miller) Subject: process eviction For some reason, process don't get evicted from my sparcstation when I start to use the keyboard; I have to explicitly evict them. This has happened several times recently with Garth's background simulations. 1607. Date: Wed, 19 Sep 90 15:24:13 PDT From: Fred Douglis <douglis> Subject: race condition when evicting processes ethan confirmed that this might be the problem. >>>>> On Wed, 19 Sep 90 14:35:29 PDT, Fred Douglis <douglis> said: >> a thought about eviction. when you had trouble, was it right after >> you were idle for 30 seconds and then became active? i wonder if >> there's a race condition, where you evict one process that moves onto >> your machine and then the remaining processes (e.g., "Reli") move onto >> your host after the eviction has taken place. then you have to go 30 >> seconds idle again before another eviction is attempted. so, this is a bit of a problem. even though pmake will find out the host was reclaimed, it won't touch the remote processes until they get evicted. another case for more integration between the migration mechanism and the load sharing policy. the kernels don't know when a process is or is not permitted to migrate onto them. they should. 1608. Date: Wed, 19 Sep 90 15:31:01 PDT From: rab (Robert A. Bruce) Subject: king King doesn't recognize any outside hosts. When it boots it, the rdate to ohm fails, and it complains that king.Berkeley.EDU is not in the hostname database. The nameserver is zworykin, and it is up. Bootp doesn't work. Ping does not work, even when king tries to ping itself it gets 100% packet loss. I updated all the commands this week, but this problem already existed before that. There doesn't seem to be anything wrong with /etc/hosts or /etc/spritehosts. Does anybody have any ideas about what could be wrong? 1609. Date: Thu, 20 Sep 90 15:43:15 PDT From: elm (ethan miller) Subject: spritemon bug I'm running spritemon (for CPU utilization) on my sparcStation, and it dies quite often. The error message that shows up in my syslog is: MachPageFault: Bus error in user proc 53e31, PC = 95b5f7b4, addr = 95b5f7bc BR Reg 80 It probably won't hurt things seriously, but I like having a CPU spritemon around. 1610. Date: Thu, 20 Sep 90 15:59:32 PDT From: shirriff (Ken Shirriff) Subject: bootp foiling ds3100 boots I looked into why my ds3100 boots were failing and the problem was in the bootp log: recfrom failed: stale remote file handle. This means FS_STALE_HANDLE, FS_VERSION_MISMATCH, or FS_NOT_CACHEABLE. Any socket gurus know why this would happen? Allspice hasn't been rebooted lately. The relevant socket was opened with socket(AF_INET, SOCK_DGRAM, 0) on port 67. My suggestion is that if bootp gets a stale handle it should restart. 1611. Date: Thu, 20 Sep 90 17:53:36 PDT From: Mike Kupfer <kupfer> Subject: problems attaching for debugging on sun4 I tried to attach Ethan's dead spritemon on Terrorism (a SPARCstation) and kept getting complaints from gdb: Reading symbol data from /X11/R4/src/cmds/spritemon/sun4.md/spritemon...done. Type "help" for a list of commands. (gdb) attach 0x53e4b Attaching program: /X11/R4/src/cmds/spritemon/sun4.md/spritemon pid 343627 0x95b5f7b4 in ?? () (gdb) where #0 0x95b5f7b4 in ?? () Error reading memory address 0x0: invalid argument (22). This is even after I reinstalled spritemon to make sure I had an up-to-date binary. Ideas, anyone? 1612. Date: Thu, 20 Sep 90 18:14:29 PDT From: mendel (Mendel Rosenblum) Subject: Re: problems attaching for debugging on sun4 The problem is spritemon jumped off to the address 0x95b5f7b4 which is not valid. The real backtrace looks something like: 0x3f84 <main+3364>: call 0x13ad8 <XtMainLoop> 0x13ae4 <XtMainLoop+12>: call 0x13af8 <XtAppMainLoop> 0x13b08 <XtAppMainLoop+16>: call 0x1b930 <XtAppNextEvent> 0x1b97c <XtAppNextEvent+76>: call 0x1b660 <XtRemoveInput+448> 0x1b7a0 <XtRemoveInput+768>: jumpl o2,g0,o7 0x7a60 <XawPanedAllowResize+1328>: call 0x7e88 <XawPanedAllowResize+2392> 0x7f0c <XawPanedAllowResize+2524>: call 0x460d0 <bcopy> Something like this can happen if a program overwrites its stack. 1613. Date: Fri, 21 Sep 90 12:25:55 PDT From: mgbaker (Mary Gray Baker) Subject: migrated process/floating point problem Terrorism just crashed with the panic in MachUserAction(): panic( "Floating point exception with bad trap code, fsr = 0x%x\n", machStatePtr->trapRegs->fsr); This occurs when there is an impending fp exception but then no exception is found. This seems to have happened periodically with migrated processes. The process in this case was "Reli -i 100 -I 5000 -c 95 -w .1 -o 2 -n 1 -g 17 -l 2e-06 -r 2 -u 2e-05 -v 0.01389 -x 1 -d 24 -L 3 -b 0.00157603 -m 0 -M 0 ", 1614. Date: Fri, 21 Sep 90 14:28:31 PDT From: Mike Kupfer <kupfer> Subject: mkmf presumption (whining) mkmf thinks it can tell whether you've got a command or a library, and there's no way (or at least no documented way that I can see) to tell it which prototype Makefile to use when it guesses wrong--which seems to happen frequently. 1615. Date: Fri, 21 Sep 90 15:46:59 PDT From: Mike Kupfer <kupfer> Subject: sprintf and vsprintf return wrong thing sprintf() and vsprintf() currently return the buffer string (that was passed in). In the ANSI world, these routines are supposed to return the number of characters that were put into the string. The fix seems simple enough, but I wonder how much user code we'd break if we did it. Is there a plan for making non-critical user-visible incompatible changes at regular times? 1616. Date: Fri, 21 Sep 90 15:53:50 PDT From: ouster (John Ousterhout) Subject: Re: sprintf and vsprintf return wrong thing The problem is that ANSI C and BSD disagree on this. So far we've stayed with BSD. I agree that we should switch to ANSI at some point, but it would be nice to do it late, so that other people get to find and fix all the programs that depend on the old conventions. 1617. Date: Fri, 21 Sep 90 16:08:53 PDT From: Mike Kupfer <kupfer> Subject: Re: sprintf and vsprintf return wrong thing Well, for this specific case BSD is in fact switching to "int sprintf()" (that's what's on okeeffe, monet, and arpa these days). In the general case, staying with old BSD declarations will cause increasing problems as more ANSI user code is written. We'll avoid some hassles as long as function prototypes are turned off for user code, but we might still get bit by, say, int -> void changes for signal handlers. The problem is that we're using "__STDC__" to mean "supports function prototypes". This is an incorrect use of the symbol. "__STDC__" is supposed to mean "is ANSI compliant". 1618. Date: Fri, 21 Sep 90 15:54:22 PDT From: shirriff (Ken Shirriff) Subject: Mail got trashed My mail file just got trashed. A few bytes got truncated from the start of Mike's mail message, so it got appended to the previous message and starts: ntf return wrong thing Date: Fri, 21 Sep 90 15:46:59 PDT From: Mike Kupfer <kupfer> 1619. Date: Fri, 21 Sep 90 17:46:41 PDT From: rab (Robert A. Bruce) Subject: profiling on ds3100's (whining) The profiling startup code, /usr/lib/mcrt0.o1.31, apparently mucks around with the internals of atexit(). Since the Sprite library uses different names for atexit() internals, the procedure that writes out the profiling data never gets called. Is there any way we can get source code for the ds3100 startup code? 1620. Date: Fri, 21 Sep 90 17:52:07 PDT From: Mike Kupfer <kupfer> Subject: ld lies about who wants undefined external (sun3) I was trying to build a kernel and was getting told sun3.md/dbgMain.c:1268: Undefined symbol _DbgComplain referenced from text segment This is a lie. It's sun3.md/dbgTrap.s that wants _DbgComplain. 1621. Date: Sun, 23 Sep 90 21:08:41 PDT From: Mike Kupfer <kupfer> Subject: sage Pmeg thrashing (whining) Sage got a serious case of the slows. Fred looked at it briefly and thought it was "Pmeg thrashing" (which somebody should explain to me some time - what's a Pmeg?). Rebooting cured the problem, but X and the printer suffered greatly until this was done. 1622. Date: Sun, 23 Sep 90 21:38:10 PDT From: Mike Kupfer <kupfer> Subject: kdbx man page SYNOPSIS is useless It documents non-existent options and fails to document the correct options. One wonders what other parts of the man page are inaccurate. 1623. Date: Mon, 24 Sep 90 14:28:00 PDT From: Mike Kupfer <kupfer> Subject: sun3 net module broken? I can't boot a sun3 kernel using the uninstalled sources. It appears to initialize the ethernet card and then goes into the debugger. John H. suggests that the net module may be broken. Anyone know what the scoop is? 1624. Date: Mon, 24 Sep 90 15:53:24 PDT From: Mike Kupfer <kupfer> Subject: ds3100 booting weirdness Why is that that when I boot a ds3100, sometimes I have to say "ds3100.md/foo" and other times I have to say just "foo"? (I notice that the bootplog doesn't show the fact that I booted mustard a couple times over the weekend. Are there multiple bootp's running around or something?) 1625. Date: Mon, 24 Sep 90 16:58:12 PDT From: shirriff (Ken Shirriff) Subject: Re: ds3100 booting weirdness The easy solution for the dec booting sequence is to always type "init" before booting. The complete answer is that if the decstation doesn't know who is the tftp server, "boot -f tftp()foo" will work, but if it does know who is the server, "ds3100.md/foo" is necessary. The decstation knows who is the server if it has started booting something already and hasn't had "init" typed. To see what it thinks is the server, type "printenv" at the prom. There's a variable it defines with the address of the server if it knows who it is. 1626. Date: Tue, 25 Sep 90 16:45:40 PDT From: Mike Kupfer <kupfer> Subject: fgrep loses on metacharacters echo "fooah" | fgrep foo.h returns fooah when it should return nothing. (Even if fgrep and grep share the same implementation, the fgrep interface should be different, so that one doesn't have to screw around with escaping metacharacters.) 1627. Date: Tue, 25 Sep 90 16:48:51 PDT From: shirriff (Ken Shirriff) Subject: Re: fgrep loses on metacharacters I just made fgrep a symbolic link to grep, since various shar archives expected fgrep to exist. Thus, as the man page says: fgrep is an alias for grep. 1628. Date: Wed, 26 Sep 90 11:11:27 PDT From: mendel (Mendel Rosenblum) Subject: Device #3 kills Sprite I made the mistake of creating a device with the major number of 3 on the Sprite cluster in cory. Stat'ing this device causes the kernel to jump to location 0 on both the ds3100 and the sun4. The problem is due to some garbage left over from an aborted attempt to stuff the ipServer into the kernel. I've correctly this problem I my copy of the file system. Until this gets installed, avoid typing typing ls or stat'ing any files in /dev/ over in cory. 1629. Date: Wed, 26 Sep 90 13:19:25 PDT From: mendel (Mendel Rosenblum) Subject: Sprite in Cory problems Sprite in Cory really sucks. Here are some of the problems: 1) Sprite RPC over INET routes didn't work to evans because /etc/spritehosts has the wrong ethernet address for the gateway machine. This wrong address also broke the ipServer routing. I've fixed this. 2) Sprite RPC over INET routes doesn't work on machines with the wrong byte order such as ds3100. The problem here was the Host_* library was changed to return the inet address of a host in host byte order rather than network byte order. Netroute (which uses the Host_* library) wasn't changed. I've fixed and installed a new netroute. 3) Kernel resident inet address of a machine doesn't get set explictly by the Sprite boot scripts. This doesn't cause problems in evans because the machines RARP and get the response from SunOS machines. There are no machines that reply to the RARP in cory. This causes all the INET routes to not work. I added an explict "netroute -s" command to bootcmds in cory. 4) The ipServer doesn't work correctly because there is no gateway host to bounce packets off. This means it can't talk to any local net hosts not in the /etc/spritehosts. Putting local host in /etc/spritehosts is a bad idea because Sprite starts to ARP/RARP for them. This requires adding ARP to the ipServer. 5) X servers don't work over there. X11 R4 is not present and X11 R3 doesn't appear to work. 6) There is no Sun machine running Sprite in Cory other than raid2. These means we have to debug raid2 from evans. 1630. Date: Wed, 26 Sep 90 15:42:12 PDT From: Mike Kupfer <kupfer> Subject: Eng. Manual: imported programs with multiple targets The Engineering Manual (section 4.4) should say something about imported code with multiple targets (e.g., RCS). Without looking at existing examples, one might think that the correct (or at least, a workable) way to install the code is /sprite/src/cmds/rcs/{rcs,ci,co,rcsdiff} when in fact the correct way (and apparently the only way that mkmf understands) is /sprite/src/cmds/{rcs,ci,co,rcsdiff} (with the usual symbolic links in "ci" et al.) 1631. Date: Thu, 27 Sep 90 10:35:34 PDT From: mendel (Mendel Rosenblum) Subject: Recovery reopens deleted files The bug is that clients to recovery on delete files. The recovery systems seems to recovery every handle that a client has. Because the system doesn't explictly frees handles of files that are delete, clients reopen delete files. This seems really silly to me. It's also very wasteful. It causes more RPCs at recovery time and loads the server down doing bogus reopens. If file number of the delete file has been reused, the client delete handle will be transformed into a handle for this new file wasting space on both the server and client. Mary said she would fixed when she upgrades the recovery system. 1632. Date: Thu, 27 Sep 1990 14:23:59 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: restore broken The last two times I've tried to restore something I've gotten the following error: restore -f /hosts/allspice/dev/exabyte -v /sprite/src/kernel/mach.jhh opening /hosts/allspice/dev/exabyte as archive file rewinding tape ... done rewinding tape. reading tape label rewinding tape ... done rewinding tape. Using tape #23 TapeLabel=|SPRITE DUMP TAPE #23 023 01 0 463316509 Tue Sep 25 02:44:13 1990 /sprite/src 023 02 0 546856002 Tue Sep 25 03:52:28 1990 /sprite/src/kernel 023 03 0 209930669 Tue Sep 25 05:02:50 1990 /c 023 04 0 253712785 Tue Sep 25 06:26:47 1990 /b 023 05 0 256980001 Tue Sep 25 07:55:22 1990 /X11 023 06 0 389326483 Tue Sep 25 08:47:44 1990 /scratch3 | Using file #2 skipping 2 files rewinding tape ... done rewinding tape. successfully skipped 2 files successfully forked tar tar.gnu: Hmm, this doesn't look like a tar archive. tar.gnu: Skipping to next file header... 1633. Date: Thu, 27 Sep 90 15:31:46 PDT From: joel (Joel A. Fine) Subject: Possible sprite/ultrix incompatibility I am trying to run a couple of window-based programs that I copied over from a decstation running Ultrix to one running Sprite. At first, I got the following error: getsvc: stat of /etc/svc.conf failed getsvc: stat failed: No such file or directory X Toolkit Error: Can't Open display There was no file called /etc/svc.conf on the Sprite machine, so I copied that file from the Ultrix machine (I hope that doesn't cause any wierd side effects) and am now getting Segmentation Violations, with the following message going to my syslog: Bad user TLB fault in process 4838: pc=554554 adr=0 The programs are in /r1/joel. Anything starting with dx (dxcalc, dxcalendar, etc.) in this directory exhibit this problem. Does anyone have any ideas on why this happens, and what can be done to correct it? I don't think the source code is available for these programs, so it may be tough to figure out. 1634. Date: Fri, 28 Sep 90 10:33:00 PDT From: ouster (John Ousterhout) Subject: Mail problems Mail doesn't seem to be getting into Sprite, and "mailq" shows a bunch of jobs apparently stuck in the outgoing mail queue. Does anyone (besides our dear departed Fred) know how to fix these problems? I've tried restarting sendmail on Allspice but that doesn't seem to have fixed either problem. 1635. Date: Fri, 28 Sep 90 12:18:29 PDT From: shirriff (Ken Shirriff) Subject: Mail problems I removed the lock files in /sprite/lib/mqueue and reran sendmail (with sendmail -q). This sent out mail stuck on sprite. Sendmail couldn't communicate with cory for some reason. I tried to do this on ginger, but got: Connecting to allspice.berkeley.edu via tcpld... Trying 128.32.150.27... Connection timed out during user open with allspice.berkeley.edu bmiller@sprite.Berkeley.EDU... Deferred: Host allspice.berkeley.edu is down so there seems to be something wrong between ginger and allspice. 1636. Date: Fri, 28 Sep 90 13:03:49 PDT From: ouster (John Ousterhout) Subject: Re: Mail problems How about restarting all the daemons on Allspice to see if this fixes the problem? If this doesn't work, then I think we should reboot Allspice. Both Randy and I are expecting important mail, so the problem needs to be fixed real soon (in the next hour or two). 1637. Date: Fri, 28 Sep 90 13:42:54 PDT From: Mike Kupfer <kupfer> Subject: Re: Mail problems I killed off the ipServer (and inetd) on allspice, ran /hosts/allspice/restartservers (and had to manually re-restart sendmail). Sendmail seems to be back in order, but I don't know how to tell ginger's sendmail to process its queue now (instead of waiting for the timeout to expire). 1638. Date: Sun, 30 Sep 90 09:59:37 PDT From: ouster (John Ousterhout) Subject: /sprite/lib/sendmail/aliases not handled right I noticed this morning that /sprite/lib/sendmail/aliases is writable by root and has been modified, contrary to the instructions placed at the beginning of the file. The modifications are in the "sprite-users" alias: the checked-in version has "tandrews" as part of "sprite-users", while the modified (but not checked out) version doesn't. Could it be that the script to remove a user is not handling the aliases file correctly? P.S. By the way, I've checked in the change. 1639. Date: Mon, 1 Oct 90 12:57:38 PDT From: mendel (Mendel Rosenblum) Subject: swap server recovery deadlock While testing LFS, I found the following deadlock that spans the proc, vm, and recov modules: A shell was trying to exec a df command and took a page fault while trying to copy the exec arguments to the stack. The page-in paused in DoPageAllocate because the swap server was down. The stack looked like: #0 0xf600c6f0 in Mach_ContextSwitch () #1 0xf60a3ff8 in SyncEventWaitInt (...) (...) #2 0xf60a2a40 in Sync_SlowWait (...) (...) #3 0xf60b269c in DoPageAllocate (...) (...) #4 0xf60b2790 in VmPageAllocate (...) (...) #5 0xf60b3290 in Vm_PageIn (...) (...) #6 0xf600e5d4 in MachPageFault (...) (...) #7 0xf6011320 in Vm_CopyOut () #8 0xf60814d8 in DoExec (...) (...) #9 0xf60807d0 in Proc_Exec (...) (...) #10 0xf608063c in Proc_ExecEnv (...) (...) Note that the PCB of a processes is locked during the exec. When the swap server rebooted, the recovery module calls Fsutil_Reopen() which executes the following code: /* * Kick all processes in case any are blocking on I/O */ Proc_WakeupAllProcesses(); /* * Tell VM that we have recovered in case this was the swap server. */ Vm_Recovery(); Proc_WakeupAllProcesses() blocks because the PCB of the df process is locked. The df process is not continuted until Vm_Recovery() is called. The swap server is marked as back up by the client. 1640. Date: Mon, 1 Oct 1990 21:50:41 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: utils include files The utils modules has a number of public header files, such as hash.h, trace.h and now bf.h. None of these are installed by "pmake installhdrs" because mkmf thinks they are private because their name doesn't start with "utils" as per the Sprite coding conventions. 1641. Date: Tue, 2 Oct 90 10:34:28 PDT From: ouster (John Ousterhout) Subject: Re: utils include files I believe that there is a special mkmf variable you can set to declare header files public even their names don't fit the normal patterns for public header files. Check the mkmf documentation (or the .mk files) for details. Perhaps it's called "PUBHDRS"? I don't remember whether (a) you set it in local.mk before including SYSMAKEFILE, or (b) whether you add to it ("PUBHDRS += ...") after including SYSMAKEFILE. 1642. Date: Tue, 2 Oct 90 12:16:55 PDT From: mendel (Mendel Rosenblum) Subject: /X11.old prefix change breaks sprite Somehow, the prefix table on some of the machine that have been up since before the prefix change on allspice are incorrect. For example from treason: treason% prefix Prefix Server Domain File # Version / allspice 10 2 1 imported /swap1 allspice 0 2 1 imported /user3 (none) -1 -1 -1 imported /user1 allspice 2 2 1 imported /X11 allspice 9 2 1 imported /user2 assault 0 2 1 imported /c oregano 3 2 1 imported /sprite/src allspice 7 2 1 imported /sprite/src/kernel allspice 6 2 1 imported /user4 assault 9 2 1 imported /mic allspice 3 2 1 imported /sprite/spool/msgs oregano 776 3499 0 imported /b oregano 4 2 1 imported /local allspice 8 2 1 imported /X11.old allspice 9 2 1 imported Note that /X11 and /X11.old are the same prefix (<allspice,9,2>). Combined with migration, this can cause machines to crash. 1643. Date: Tue, 02 Oct 90 12:34:38 PDT From: Mike Kupfer <kupfer> Subject: chgrp as root failed I found a source tree that was group "wheel" instead of "sprite". I su'd to root and did sage-4# chgrp -R sprite . and I got back chgrp: You are not the owner of sh.func.c chgrp: You are not the owner of sh.func.c,v If I'm root, why should chgrp care? 1644. Date: Tue, 2 Oct 90 13:18:00 PDT From: ouster (John Ousterhout) Subject: Re: /X11.old prefix change breaks sprite I suspect that the problem is that Bob changed the name of a domain without changing its partition, and that clients can then get two prefixes with different names in their prefix tables. This seems like a bug, but I suspect that it may not be easy to fix. 1645. Date: Tue, 2 Oct 1990 17:05:22 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: rcsmerge Rcsmerge sometimes screws up and fills the screen with '?' characters. I think this is because 'ed' is being passed a bogus command script, but I can't figure out why. There is little chance of this being fixed, but I just thought I'd report it. 1646. Date: Wed, 3 Oct 90 15:07:40 PDT From: ouster (John Ousterhout) Subject: Error code 16 When I run pmake on SPARCstations I'm getting fairly frequentt "*** Error code 16" messages, which abort the compilation even though there are no compiler errors. Is anyone else getting these? I don't suppose they could be related to the prefix change? I'm not compiling in an area whose prefix has changed. 1647. Date: Wed, 3 Oct 90 15:14:08 PDT From: mendel (Mendel Rosenblum) Subject: Re: Error code 16 I've gotten this error 16 too. About the same time I got a syslog message of: SigMigSend: process 4126e no longer migrated. 1648. Date: Thu, 4 Oct 90 09:26:42 PDT From: ouster (John Ousterhout) Subject: Floating-point crash Mercenary crashed last night with the following error message: Floating point exception with bad trap code, fsr = 0x68ba0 Is this the same floating-point error that we discussed at the Sprite meeting this week? 1649. Date: Thu, 4 Oct 90 11:17:49 PDT From: mendel (Mendel Rosenblum) Subject: Re: Floating-point crash This is more a migration problem than a floating-point problem but after a week-long paid vacation to Hawaii I might be incline to take a look at it. The problem caused by a combination of floating point exceptions, timer interupts, and migration. Mike's program took a timer interrupt that signaled the end of its quantum while the FPU was crunching on a floating point add with operands that the hardware didn`t like. The quantom expire caused the process to be context switched with the context switch code noting the floating point exception and trap code. While context switched someone (probably pmake) try to migrate the process. The code in Mach_EncapState() resaves the floating point state from the FPU in the trap regs. This is done because it can't tell if the FPU state was dumped or not. Unfortunately it overwrites the pending FPU trap code. When the process is restarted on the new system the kernel panics because the pending exception flag is set and the trap code says no trap. This error can only happen when a context switch causing trap happens with a floating point exception pending in the FPU. I've fixed Mach_EncapState() not to overwrite the trap code. 1650. Date: Thu, 4 Oct 90 19:04:12 PDT From: dedood (Paul de Dood) Subject: printer queue consistently jams I seem to have the amazing ability to jam printer queues on -Pps. I'm running dvips sending it straight to the ps printer. Could you please look into this, as this is the only PostScript printer I have access to. 1651. Date: Thu, 4 Oct 90 19:23:54 PDT From: dedood (Paul de Dood) Subject: more printer queue problems It seems that I mess up the printer queue with a lpr -d -Pdp, so I doubt the problem is printer specific. 1652. Date: Fri, 5 Oct 90 13:27:22 PDT From: elm (ethan miller) Subject: raid1 crashed Here's what it printed out; the kernel it was running was no longer around so it couldn't be debugged. FindComponent, no handle <0x4001b> for ".." fileNumber 127203 Fatal Error: handleRelease, handle <1,77,1,127204> "sun" not locked Entering debugger with a interrupt trap (16) exception at pc 0xf60933d4 The error occurred just after I put a lot of files (~80MB) onto raid1, had them copied to a Unix machine in cory, and deleted them. "sun" was the name of a subdirectory in /r1/tmc/bin (I did a /bin/rm -fr tmc after the copy to giverny [Ken Lutz's machine] was done). We rebooted it at 13:25 on 10/5/90. 1653. Date: Fri, 5 Oct 90 16:10:23 PDT From: shirriff (Ken Shirriff) Subject: Allspice crashed Allspice crashed, apparently with a deadlock. The symptoms were that it was going through repeated recovery with subversion and its load average was around 5. Then everything wedged up. Unfortunately, my L1-i information function crashed allspice, and then shallot was down, so I couldn't debug it. 1654. Date: Mon, 08 Oct 90 12:15:34 PDT From: rab (Robert A. Bruce) Subject: xinit on chisum X runs fine on king, but when I try to run it on chisum, I get chisum> xinit giving up. xinit: connection refused (errno 61): unable to connect to X server chisum> -bob 1655. Date: Mon, 8 Oct 90 12:22:34 PDT From: seth (Seth J. Teller) Subject: gremlin can't draw small, thick circles the latest sprite version of gremlin under X cannot draw small (~2-3 pixel) diameter circles using the thickest line style. also: gremlin cannot draw filled circles of any size. i'm not sure if this is a bug or a feature. 1656. Date: Mon, 08 Oct 90 15:20:57 PDT From: rab (Robert A. Bruce) Subject: Window Underflow Sabotage just got a watchdog reset because of a window underflow. 1657. Date: Mon, 8 Oct 90 15:38:56 PDT From: bmiller (Bob Miller) Subject: msgs??? Did something happen to msgs. For the last week or so, all I get back is "No new messages," which is kinda hard to believe. 1658. Date: Tue, 9 Oct 90 10:53:42 PDT From: ouster (John Ousterhout) Subject: Sendmail dead? Sendmail seems to be dead on Allspice again. Can someone in Evans hall restart Allspice's daemons? This problem is happening too frequently for my taste. Perhaps it's time to starting thinking about how to make this stuff more reliable. 1659. Date: Tue, 9 Oct 90 13:59:26 PDT From: gibson@apathy.Berkeley.EDU (Garth Gibson) Subject: Re: excessive migration usage > From dedood@sprite.Berkeley.EDU Tue Oct 9 12:44:19 1990 > Date: Tue, 9 Oct 90 12:44:42 PDT > To: gibson@sprite.Berkeley.EDU > Subject: excessive migration usage > > Last night and this morning I have been unable to migrate any of my > processes because you have fully loaded every sun4 and ds3100 on the > sprite network. This seriously reduces my productivity. > Is it necessary for you to use every machine for days, or is it possible > for you to leave some machines available for the rest of us? > > Thanks, > Paul. My jobs are installed at the "background" level. There are not supposed to be able to stop compilation migration. Each of my tasks requires a small VM space so it should not exhaust swap space or induce thrashing. The goal of Fred Douglis' system was to make all idle cycles available to background tasks while not interfering with interactive work. While it is true that Fred did not introduce full "nice"-like scheduling in the sprite system (so my jobs do get CPU time slices when a compile migrates on top of them), he claims that I should not decrease the parallelism available to compiles. There are a couple of known bugs: 1) after a series of remigrations because workstations come into use by their primary user, up to 15 jobs can end up running on the host processor when their should be suspended by pmake, 2) pmake fail to notice some machines going idle after a series of remigrations, 3) if my pmake host processor is also being used by a busy human directly in front of it, the process table sometimes overflows and it can be difficult to get it out of this state (Mendel has used remote kernel debugging to kill off a few processes). Because of the first and second bug I check on my tasks every once in awhile and use "mig -B -h .... -p ...." to spread clustered jobs to machines in the "avail" state reported by rup. These manual remigrations should also be at the "background" level. Because of the third bug I try to use, as hosts, machines that are less frequently used directly by others. My favorites are vagrancy and saffron because they are in my office. On the weekend saffron was down so I used sassafras for a sun4 host. This machine was suggested by Fred. The way background pmake works is based on keeping all idle machines busy at all times. It is not easy for me to keep a few idle at all times without using a very small subset at all times. If you are using rsh to execute on a machine not directly in front of you, then I believe Fred's scheme for determining "avail" status will not notice you unless you keep the load average over 1.0. It may be the case that Fred's system has more bugs than I know about. In that case my heavy use and your exasperation are exactly the debugging that Sprite relies on. Bugs in this system are sensitive to network-wide activity so they are difficult isolate and recreate. As John has said to me about this before, "bang away". In case you think me unrepentent, it is clear that my tasks should not be allowed to interfere with your productivity. While I would not like my jobs killed arbitrarily, if you do so please inform me. I don't think that would solve your problem because pmake would migrate another of my tasks onto your machine. A simpler solution is to kill -STOP my job; pmake will think it busy and should leave the machine alone (unless the machine is "avail" and I manually migrate to it). On the overkill side, you can disallow all migration to a particular machine with "migcmd -I none" (I have never used this, but I'm told it works). The best solution would be for background migration to work correctly, but that probably means that it will have to run broken until Sprite gurus have the time to debug it. 1660. Date: Tue, 09 Oct 90 16:45:23 PDT From: Mike Kupfer <kupfer> Subject: paranoia check for Vm_Cmd (whining) It would be nice if Vm_Cmd would check the validity of its arguments. In particular, I made the mistake of doing "vmcmd -n 0", which brought down sage when it tried to do a division by 0. 1661. Date: Wed, 10 Oct 90 13:48:51 MET From: douglis@cs.vu.nl Subject: inetd needs a kick finger @sprite is getting "connection refused".... 1662. Date: Wed, 10 Oct 90 11:51:07 PDT From: ouster (John Ousterhout) Subject: Finger dead I killed and restarted inetd on Allspice; this seems to have brought "finger @allspice" back to life again. 1663. Date: Wed, 10 Oct 90 11:56:12 PDT From: ouster (John Ousterhout) Subject: Bug in new installroute? When John Wawrzynek rebooted his DS3100 this morning with the "new" kernel, he got a zillion messages about Net_InstallRoute failing with a bad system call argument, or something like that. The machine seems to work fine, but the error message are a bit worrisome. Could this be related to the new version of installroute? 1664. Date: Wed, 10 Oct 90 12:24:13 PDT From: dedood (Paul de Dood) Subject: mail When I got in this morning, my mailbox showed that I had mail. However, when I tried to enter "mail", I got the following message: Warning: encountered nulls at 5. Mail spool file may be damaged. No mail for dedood 1665. Date: Wed, 10 Oct 90 12:31:20 PDT From: shirriff (Ken Shirriff) Subject: Re: mail /usr/spool/mail/dedood contained 976 bytes of random garbage. This is probably from a 976 byte mail message sent from a machine just before it crashed, before the data got written. Bob, can you restore the previous /usr/spool/mail/dedood from tape, if it contained anything? 1666. Date: Thu, 11 Oct 90 10:59:29 PDT From: rab (Robert A. Bruce) Subject: dump failed The daily dump failed last night. Only /user1 was dumped successfully. It was not a hardware problem. It looks like allspice was running slow. /user1 took over 4 hours instead of the expected 10 or 20 minutes. The next filesystem was still in progress when allspice was rebooted this morning. 1667. Date: Thu, 11 Oct 90 11:00:56 PDT From: ouster (John Ousterhout) Subject: Allspice crash Allspice was down when I came in this morning. It was printing continuous message of the following form: client 52 dropped 30 write-back & invalidate requests for "johnw" <10,2237> The console was lifeless: <BREAK>-commands didn't do anything, and when I attempted to run commands I got the message "no more processes". At this point I rebooted Allspice. 100 minutes later, the system finally got rebooted. Here is a partial list of some of the problems that occurred: 1. Rebooting from Ginger was slow: about 15 minutes to get the first kernel image over. I think this may have been due to clients bashing on Ginger and confusing the TFTP protocol (many or all of the clients were spewing continuous messages about interaction problems with Allspice during the reboot). Can someone move the current kernel to the place on disk from which it can be booted, and change the sign on Allspice's console? 2. Allspice rebooted continuously in a never-ending loop. There were several reasons for this, detailed below. 3. /local did not have a "lost+found" directory, causing fscheck to abort on it. Shouldn't fscheck create lost+found automatically? Mendel and I tried to create lost+found manually, but found that i-number 3 is already in use by the symbolic link /local/cmds. Does lost+found need to be inumber 3? We thought it did, and also thought cmds was a directory, so we mv-ed it to lost+found. However, this didn't work because cmds was really a symbolic link. We eventually fast-booted Allspice without fixing this problem. It needs to be fixed soon. I've temporarily commented-out /local's line in /hosts/allspice/mount, so that Allspice will (hopefully) be able to boot if it should crash before the problem is fixed. This means that /local won't be available after future reboots. How did /local get created without a /lost+found directory? This sounds like a bug in the software that makes new filesystems. 4. The booting scripts somehow got confused into thinking /local was the root partition (could it have anything to do with the fact that /local is on rsd10.0c?), so the above errors in /local caused Allspice to reboot continuously. 1668. Date: Thu, 11 Oct 90 09:47:26 PDT From: gibson@apathy.Berkeley.EDU (Garth Gibson) Subject: floating point printf In the last few weeks it looks like printf "%g" has changed again. I'm not sure which machine type because my simulations were run on both, but now NaN shows up as "(NaN)" - this is fine - but a number that until recently printed fine is coming out "-.,0.,(e+15" 1669. Date: Thu, 11 Oct 1990 12:04:35 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: boot sequence broken The problems with allspice rebooting continuously are related to the new boot sequence. The kernel attempts to attach a list of default disks as the root "/", then detachs them if they aren't (based upon a few rules about the root disk). This means that those disks that are rejected end up with the name "/" in the summary sector, which later on fools fsattach into thinking they are the root disk, causing a reboot if fscheck fails on them. I think the best solution would be for the kernel to not modify the summary sector (ie put back the old prefix when detaching), rather than storing the wrong prefix. I don't know how feasible this is. The other alternative is to have fsattach determine the root disk in a different manner. If we use this method fscheck will still print out the wrong name for the domain when it is checking the disk (eg "/" instead of "/local"). 1670. Date: Thu, 11 Oct 90 14:28:13 PDT From: ouster (John Ousterhout) Subject: Migration bugs? The message below is from Paul de Dood about problems he had getting machines for Pmake when Garth was running simulations. I wonder if perhaps migd was seeing high loads on the machines (because of Garth's simulations) and declaring the machines to be "in use"? >From dedood Thu Oct 11 14:06:54 1990 Date: Thu, 11 Oct 90 14:06:54 PDT From: dedood (Paul de Dood) To: ouster Subject: Re: excessive migration usage I tried two different types of migration: pmake & mig -bv. In the case of pmake, it was only executing a single compile at a time, even though there were many files which could have been compiled simultaneously, so I believe no processes were migrating. In the case of mig -bv .... it said that no idle hosts were available. A rup revealed that every sun4 & ds3100 either refused or was in use. Each machine had a load average above 1, the only significant jobs being run on the machines were all owned by gibson (about 4 processes/machine). It was 1 AM, so I know that the machines were not all being used (one was in my sight, so I know it wasn't being used). 1671. Date: Thu, 11 Oct 90 14:30:22 PDT From: Mike Kupfer <kupfer> Subject: rpc lint rpcCltStat.c:308: warning: `SpecialStat' defined but not used rpcSrvStat.c:286: warning: `SpecialSrvStat' defined but not used Both of these functions are static. Are they kept around for reference? Should they be ifdef'd out, or deleted, or...? 1672. Date: Thu, 11 Oct 90 15:32:18 PDT From: Mike Kupfer <kupfer> Subject: lint in sun3 SCSI code The table devCntrlr in sun3.md/devConfig.c yields the following complaint from gcc: sun3.md/devConfig.c:53: warning: initialization between incompatible pointer types Each entry in the table includes an interrupt routine. The table thinks the interrupt routine looks like Boolean (*intrProc) _ARGS_((ClientData clientData)); However, the definition of DevSCSI0Intr is Boolean DevSCSI0Intr _ARGS_ ((ClientData clientData, List_Links *newRequestPtr)); Note the second argument. I don't understand how all these fun device tables hang together, so I'd rather that somebody else fix this... 1673. Date: Thu, 11 Oct 90 16:02:00 PDT From: Mike Kupfer <kupfer> Subject: interrupt handler lint Do interrupt handlers return ints? The sun3 and sun4 versions of Mach_SetHandler think they return ints. At least one handler (DevZ8530Interrupt, for the sun3) is declared as a void. From looking at sun[34].md/machIntr.s, I would say that "void" is correct. Anyone know the right answer? 1674. Date: Thu, 11 Oct 90 16:07:34 PDT From: mendel (Mendel Rosenblum) Subject: Re: interrupt handler lint >Do interrupt handlers return ints? The sun3 and sun4 versions of >Mach_SetHandler think they return ints. At least one handler >(DevZ8530Interrupt, for the sun3) is declared as a void. From looking >at sun[34].md/machIntr.s, I would say that "void" is correct. Anyone >know the right answer? They should return Boolean; TRUE if the interrupt handle dectected an interrupt condition was present, FALSE otherwise. This is helpful in the case you have many devices at the same interrupt priority and no interrupt vectoring. 1675. Date: Thu, 11 Oct 90 17:26:25 PDT From: mendel (Mendel Rosenblum) Subject: OFS file system writeback is braindead The original Sprite file system cache writeback scheme has serious performance problems. A process is used per file being written to disk. These writeback processes end up writing the files a block at a time in a roundrobin fashion. Because the files themselves a placed randomly on disk, every disk write requires a random seek. If we limited the system to one writeback process we would probably have a much higher write rate. 1676. Date: Thu, 11 Oct 90 17:38:10 PDT From: ouster (John Ousterhout) Subject: Re: OFS file system writeback is braindead I believe that Brent brain-killed the file system writeback in an attempt to make it fairer, so that no one huge file could monopolize the disk controller. 1677. Date: Thu, 11 Oct 90 18:20:18 PDT From: shirriff (Ken Shirriff) Subject: Re: floating point printf There seem to be several things causing Garth's problems (on sun4c). a) printf doesn't handle denormalized numbers, and gives garbage, because... b) modf doesn't handle denormalized numbers, and gives garbage. Garth gets denormalized numbers because c) adding 0 and NaN doesn't work. In Garth's case, it creates a denormalized number: 0000000... + 7ffffff... -> 00000000ffffffff In my test case, 0 + NaN -> 0 main() { double a,b; a = 0; b = a/a; a += b; printf("a = %f, b = %f\n",a,b); } Output: a = 0.000000, b = (NaN) 1678. Date: Thu, 11 Oct 90 18:32:20 PDT From: gibson@apathy.Berkeley.EDU (Garth Gibson) Subject: Re: floating point printf In addition to Ken's comments, the sun hardware can be made to do the right thing because SunOS seems to have no problems with my simulations. 1679. Date: Thu, 11 Oct 90 18:34:51 PDT From: mendel (Mendel Rosenblum) Subject: Re: floating point printf >In addition to Ken's comments, the sun hardware can be made to >do the right thing because SunOS seems to have no problems with >my simulations. This is false. The "sun hardware" doesn't handle NaNs. It spits them out to the software. 1680. Date: Thu, 11 Oct 90 18:48:11 PDT From: rab (Robert A. Bruce) Subject: Re: floating point printf Once you get a NaN, I think you keep it. There is nothing you can do you can do to get rid of it. Adding NaN to 0 should yeild a NaN. So your example should print ``a = NaN, b = NaN''. I tried running on a few different systems: sparcStation, sunOS, cc: a = NaN, b = NaN sparcStation, sunOS, gcc: a = NaN, b = NaN sun3, sprite, gcc: a = (NaN), b = (NaN) sparcStation, sprite, gcc: a = 0.000000, b = (NaN) sun4, sprite, gcc: a = 0.000000, b = 0.000000 ds3100, sprite, cc: a = (NaN), b = (NaN) ds3100, sprite, gcc: a = (NaN), b = (NaN) ds3100, ultrix, cc: a = NaN, b = NaN It looks to me like there is a problem with the sun4s. It is especially weird that old sun4's give different answers than sparcStations. (I also think we should drop the parens around NaN, since neither ultrix nor sunOS has them.) 1681. Date: Thu, 11 Oct 90 18:54:50 PDT From: Mike Kupfer <kupfer> Subject: Re: floating point printf Yes, NaN + x = NaN for all x. According to the 68881 book, the 68881 distinguishes between "signalling" NaNs (which generate an exception when used) and non-signalling NaNs (which don't). Perhaps we're not doing the trap handling correctly for signalling NaNs? 1682. Date: Thu, 11 Oct 90 23:52:30 PDT From: shirriff (Ken Shirriff) Subject: Re: floating point printf I looked in the sun4 floating point simulation code (mach/sun4.md/addsub.c) and it has code to handle both signalling and non-signalling NaN's, with comments explicitly saying "NaN + x -> Nan". So it should be doing the right thing. 1683. Date: Fri, 12 Oct 90 10:49:32 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: migration load average [the cc to the non-sprite folks is more to acknowledge the problem than because they should read this whole thing...] it sounds like the problem with migration is due to the load being driven up for a long time. the way that the migration daemons work is that once they're allowing migration, they disable migration if the 5, 10, or 15-minute load average is above some threshold. currently those thresholds are something like 1.5, 1.75, and 2.0 respectively. once migration's disabled, it's enabled only when all three averages are below some threshold (.75, 1, 1.5 right now). the idea there is that the short-term load shows that the load is definitely dropping off again. perhaps the 10 and 15 minute avgs could be ignored in that case. the values for all these things are something i've played with off and on for quite a while. it seems to me that the high-end threshold used to be in the reverse order (as well as lower), something like 1.5, 1.25, 1.0. i raised the threshold to try to keep simulations from disabling migration, but i think i got the order wrong after all, so it's easy for garth's simulations to get the 5-min load to 1.5 (with the help of a couple of fluctuations in other processes), and then the load never drops below .75 again. to conclude: how about if someone edits /sprite/src/daemons/migd/migd.c to have values along the lines of #define THRESHOLD_HIGH0 2.5 #define THRESHOLD_HIGH1 2.25 #define THRESHOLD_HIGH2 2.0 the idea here is that once a machine is available for migration, it would take a lot more load to keep it from being available again (bear in mind it needs a very low load to be considered for migration in the first place). this also reverses the order again, which in retrospect makes more sense to me. a spike in the load (5-min avg) should have to be pretty high to disable migration, whereas a 15-min avg of 2 means the machine is definitely pretty loaded, and probably by something other than the 1 process that migrated onto it. ultimately, of course, the whole model needs to be generalized, and more easily parameterizable. 1684. Date: Fri, 12 Oct 90 14:29:17 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: mailq while checking on the state of some mail to myself from berkeley to here, i noticed a whole lot of locked files in the sendmail queue from yesterday morning (i guess when sendmail crashed). thought you might want to know. 1685. Date: Fri, 12 Oct 90 09:35:14 PDT From: mendel (Mendel Rosenblum) Subject: Recovery problems There are serious problems with recovery when linking with all the uninstalled modules. It seems to either a) Go into an infinite recovery loop with allspice. b) Get errors recoverying all the program texts and swap files so every user process gets killed. or c) Hang waiting for recovery. Sometimes it is just a single file that reading it causes the infinite recovery loops. More info later. 1686. Date: Fri, 12 Oct 90 10:00:53 PDT From: mendel (Mendel Rosenblum) Subject: Recov_IsHostDown If you saw a routine named Recov_IsHostDown() and used like: if (!Recov_IsHostDown(hdrPtr->fileID.serverID)) { Fsutil_Reopen(hdrPtr->fileID.serverID, (ClientData)NIL); } What would you guess that it returned? Boolean? How about a ReturnStatus which is SUCCESS, FAILURE, RPC_SERVICE_DISABLED, RPC_TIMEOUT, or anyone one of another half-dozen error codes. This also points out a limitation of function prototypes. extern void foo(ReturnStatus status); Boolean b; int i; ReturnStatus status; main() { foo(b); foo(i); foo(status); foo(NIL); foo(TRUE); foo(strlen("t")); } will compile without an error message. 1687. Date: Fri, 12 Oct 90 10:28:03 PDT From: rab (Robert A. Bruce) Subject: Re: Recov_IsHostDown The problem is that `typedef' provides an alias for an existing type. It does not define a new type. As far as the compiler, or even lint, is concerned a `Boolean', a `ReturnStatus', and an `int' are exactly the same thing. If we want to be able to catch bugs like this, then we need to use C++, or get a better lint. There are some commercially available lints that will treat typedefs as different types, as well as doing strict checking of function prototypes and other ANSI features. 1688. Date: Fri, 12 Oct 90 10:32:37 PDT From: mendel (Mendel Rosenblum) Subject: Bug with large non-cached I/O Large (over 16K) reads of noncached files caused repeated timeouts and recovery with the file server. The problem was caused by changes to the RPC module that caused Rpc_MaxSizes() to indicate the system supports rpcs with 31744 bytes of data. The file system believes the RPC and tries to do RPCs of this size. The rpc/net module appears to drop these large RPC on the floor. 1689. Date: Fri, 12 Oct 1990 12:38:37 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: misleading constants This can be considered whining. In various places in the kernel we have constants such as RPC_MAX_NUM_FRAGS that define the sizes of things. Quite often changing these constants causes the code to break. In the case of RPC_MAX_NUM_FRAGS one would expect that changing the maximum number of fragments in an rpc would allow larger rpcs to be sent. In reality the rpc fails somehow. I would like to suggest that if a constant must have a particular value that the appropriate initialization code check that the constant hasn't been changed and panic if it has. Obviously there should be a comment somewhere that explains why the value cannot be changed. Otherwise we end up with the system not working for non-obvious reasons. 1690. Date: Fri, 12 Oct 90 15:43:09 PDT From: ouster (John Ousterhout) Subject: _CONST redefined? When I compile a file that includes <math..h> I get the warning message "/sprite/lib/include/math.h:22: warning: _CONST redefined". I assume that this is a consequence of Bob's recent changes? Bob, can you avoid the use of _CONST, since it seems to conflict with math.h? Thanks. 1691. Date: Fri, 12 Oct 90 15:07:03 PDT From: Mike Kupfer <kupfer> Subject: runaway shell on allspice I just killed a csh that was sucking up lots of allspice's cycles. 10e0b READY 2:30 csh -i (it was, at various times, taking 35% of the CPU). It was owned by root; is there some way to track down who might have owned it? (Under UNIX I'd check the controlling tty and see who was logged in there. Is there some Sprite equivalent?) 1692. Date: Fri, 12 Oct 90 17:17:08 PDT From: ouster (John Ousterhout) Subject: Allspice daemons dead Mail isn't getting through to Allspice, and the network daemons seem to be dead. Can someone on the 6th floor restart them? Thanks. -John- 1693. Date: Fri, 12 Oct 90 17:44:08 PDT From: shirriff (Ken Shirriff) Subject: Re: floating point printf >Bob: >(I also think we should drop the parens around NaN, since neither >ultrix nor sunOS has them.) I've modified printf to return NaN and Inf, instead of (Nan) and (INFINITY). It now handles negative NaN and Inf too, as SunOS does. On the sun4, which apparently doesn't handle subnormal numbers, subnormals are printed as "Sub" instead of "-.,0.,(e+15". The new library also has my faster fread and fwrite, which as far as I know have no bugs. The sun4 floating point bug still remains, though. 1694. Date: Fri, 12 Oct 90 18:02:07 PDT From: mendel (Mendel Rosenblum) Subject: floating point on sun4 fixed I fixed the problems reported with floating point on the sun4. The CPU was being told to skip over random instrutions after the FPU trapped to the software emulation. Mendel 1695. Date: Fri, 12 Oct 90 18:04:00 PDT From: mendel (Mendel Rosenblum) Subject: printf of subnormal numbers broken on sun4 The printf on the sun4 doesn't print subnormal numbers. It instead prints "Sub". 1696. Date: Fri, 12 Oct 1990 18:32:03 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: incomplete comments Any ideas what the "code" and "addr" parameters might be? /* *---------------------------------------------------------------------- * * Sig_SendProc -- * * Store the signal in the pending mask and store the code for the given * process. * * NOTE: Assumes that we are called without the master lock down and * with the process locked. * * Results: * In the case of a local process, SUCCESS is returned. If the process * is migrated, error conditions such as RPC_TIMEOUT may be returned. * * Side effects: * Signal pending mask and code modified. If the process being signalled * is migrated, an RPC is sent. If the process is local, the sched_Mutex * master lock is grabbed. * *---------------------------------------------------------------------- */ ReturnStatus Sig_SendProc(procPtr, sigNum, code, addr) register Proc_ControlBlock *procPtr; int sigNum; int code; Address addr; 1697. Date: Fri, 12 Oct 90 22:57:58 PDT From: Mike Kupfer <kupfer> Subject: keyboard bounce caused by high load? I noticed that the load on sage was up a bit when I was having the keyboard bounce problems. The load has gone away, and so has the bounce. I suppose it could be coincidence, but then again, maybe not. (All of this refers to when I'm running X.) 1698. Date: Fri, 12 Oct 90 23:30:59 PDT From: Mike Kupfer <kupfer> Subject: fsync weirdness Emacs is unhappy on sage. It works fine on other machhines, but on sage it compplains about an "IO error writingg" any file.. (Notice thatt the *&^^&%% keyboard bounce is back; so are TVE's pmakes.) Aftter playing around with it a bit, I've discovered that the file iss actually gettingg written back okay. What's causing the error is that fsync is failing, with errror 22 (EINVAL), which according to the man pagee means thhat the file descriptor refers to a sockket, not a file. This does not make a loot of sense to me, as the file is opened - well, creat'd - only a few dozen lines above the fsync. The appended test program fails as well. I suspect that this mess is somehow related to my crashing Sage while looking at the sun4 flloating point bug. I'm running Mendel's keernel; perhaps there's some recovery-related problem? Rebooting Sage doesn't helpp (though I haven't tried rebooting with an older kernel). I'll leave this up over the weekend in case anybody wants to look at it, but at some poiint (Monday?) if I still get complaints from Emacs, I''d like to try rebooting Allspice. -- #include <sys/file.h> main() { int fd = open("bar.out", O_RDWR | O_CREAT, 0644); if (fd < 0) { perror("bar.out"); exit(1); } write(fd, "foo bar baz\n", strlen("foo bar baz\n")+1); if (fsync(fd) < 0) perror("bar.out: fsync"); } 1699. Date: Fri, 12 Oct 90 23:43:00 PDT From: Mike Kupfer <kupfer> Subject: can't get floating point regs via gdb on sun4c Maybe this applies to vanilla sun4s, too. "info reg" lies about the contents of the f registers. I have a test program that adds 10.0 and 20.0. It uses f2 and f4. However, "info reg" shows garbage for f2 and f4 (e.g., 7.00649e-45 for f2; 1.4013e-45 and then 0 for f4). Wish: it would be nice if "info float" would display the floating point registers, so that you don't have to wade through all the ones that "info reg" displays. 1700. Date: Sat, 13 Oct 90 14:24:34 PDT From: mendel (Mendel Rosenblum) Subject: Re: fsync weirdness > Emacs is unhappy on sage. It works fine on other machhines, but on > sage it compplains about an "IO error writingg" any file.. (Notice > thatt the *&^^&%% keyboard bounce is back; so are TVE's pmakes.) > Aftter playing around with it a bit, I've discovered that the file > iss actually gettingg written back okay. What's causing the error is > that fsync is failing, with errror 22 (EINVAL), which according to the > man pagee means thhat the file descriptor refers to a sockket, not a > file. This does not make a loot of sense to me, as the file is opened > - well, creat'd - only a few dozen lines above the fsync. >I suspect that this mess is somehow related to my crashing Sage while >looking at the sun4 flloating point bug. I'm running Mendel's >keernel; perhaps there's some recovery-related problem? Rebooting >Sage doesn't helpp (though I haven't tried rebooting with an older >kernel). I'll leave this up over the weekend in case anybody wants to >look at it, but at some poiint (Monday?) if I still get complaints >from Emacs, I''d like to try rebooting Allspice. I can't repeat this problem between sage and allspice. I do get the problem between machine of different byte orders. The problem here is that the Fmt_Convert() call on Ioc_WriteBackArgs in fsioFile.c is passed a format of "w" when Ioc_WriteBackArgs looks like: typedef struct Ioc_WriteBackArgs { int firstByte; /* Index of first byte to write back */ int lastByte; /* Index of last byte to write back */ Boolean shouldBlock; /* If TRUE, call blocks until write back done */ } Ioc_WriteBackArgs; Using the convert of Ioc_LockArgs as a guide, I changed the format to "w3". Is this correct? From reading the man page I'd guess the format would be "{www}". This is assuming that Boolean == int == 4 byte integer. Anyone know what the correct format string is for this structure? 1701. Date: Sat, 13 Oct 90 14:32:44 PDT From: mendel (Mendel Rosenblum) Subject: Re: can't get floating point regs via gdb on sun4c Gdb on the sun4 has no idea about the values of the floating point registers on the sun4. The problem here is that gdb was ported before the floating point was implemented on the sun4 and it hasn't been updated "know" about the floating point. 1702. Date: Mon, 15 Oct 90 10:29:59 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: incomplete comments this was documented in the original Sig man page(s), lost to posterity. actually, i take that back, code was documented, but addr was added recently. code was some sort of sub-code to the signal. the only place i think it may be used is by the VM system when a process is killed due to a swapping error. 1703. Date: Fri, 12 Oct 90 15:07:03 PDT From: Mike Kupfer <kupfer> Subject: runaway shell on allspice I just killed a csh that was sucking up lots of allspice's cycles. 10e0b READY 2:30 csh -i (it was, at various times, taking 35% of the CPU). It was owned by root; is there some way to track down who might have owned it? (Under UNIX I'd check the controlling tty and see who was logged in there. Is there some Sprite equivalent?) 1704. Date: Mon, 15 Oct 90 08:36:49 PDT From: ouster (John Ousterhout) Subject: Re: migration load average Fred writes: It sounds like the problem with migration is due to the load being driven up for a long time.... Ultimately, of course, the whole model needs to be generalized, and more easily parameterizable. I agree with the second statement but not the first. I think that the problem is the way migd deals with load averages. Once a process has migrated onto a machine, the load average of that machine is meaningless for the migration system, I think. The process that migrated onto the machine could spawn an arbitrary number of additional processes, so there's no upper limit on how high you might expect the load to go. If migration is to support multiple levels of migration (e.g. background simulations and foreground compilations), then it should ignore the load average when deciding whether to migrate foreground stuff onto a machine that already has background stuff. In my opinion, the only good use of load average is in deciding whether the machine should be available for migration in the first place. The load average should only be considered when the machine hasn't had migrated processes recently (i.e. the load average isn't determined by migrated processes). As I recollect, Fred and I discussed this at some point, but either I wasn't able to convince him or he didn't have time to implement this. If we just edit midg.c to have the following thresholds: #define THRESHOLD_HIGH0 1000.5 #define THRESHOLD_HIGH1 1000.25 #define THRESHOLD_HIGH2 1000.0 will this achieve the effect I've suggested? Or will this also set the threshold high for initial migration decisions too? 1705. Date: Mon, 15 Oct 1990 11:53:17 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: nfsmount I received the following message from Anthony up in Alberta. About the nfsmount stuff, yes you where right we had where having permission problems. I forgot that sprite's gethostbyname returns fully quilfied host names, while at our site the policy is to reture just the name segment, hence the nfs host did not recognize the sprite machine. We fixed that and every thing works. You should add somthing in the docs about this and about the fact that the mount point should be a remote link not a dir. My first thoughts where that it is a dir and nfsmount turns it into a remote link. 1706. Date: Mon, 15 Oct 90 11:53:40 PDT From: rab (Robert A. Bruce) Subject: sendmail aliases file I thought about the sendmail alias file problem a little more. When I was at Maryland, we had two programs, `addme' and `deleteme'. These were programs that anyone could use to add or delete themselves from mailing lists. The syntax was addme <mailing-list> deleteme <mailing-list> or for someone in group wheel, addme <mailing-list> <user> <user> ... deleteme <mailing-list> <user> <user> ... So there was almost never a need to directly edit the aliases file, and there was no need to have it generally world writable. 1707. Date: Mon, 15 Oct 90 11:56:55 PDT From: Mike Kupfer <kupfer> Subject: junk in /tmp There are editor temp files in /tmp from August and other stuff from June. Do we have anything set up to remove old files from /tmp? The reason I ask is that I just got the following error message from the mailer daemon. I assume that the reason it couldn't create the temp file is that it already exists (dated 26 August). (The original message did seem to get logged, though, so I'm not sure how important the error message is.) mike -- Return-Path: MAILER-DAEMON Received: by sprite.Berkeley.EDU (5.59/1.29) id AA593485; Mon, 15 Oct 90 03:11:35 PDT Date: Mon, 15 Oct 90 03:11:35 PDT From: MAILER-DAEMON (Mail Delivery Subsystem) Subject: Returned mail: unknown mailer error 1 Message-Id: <9010151011.AA593485@sprite.Berkeley.EDU> To: kupfer ----- Transcript of session follows ----- /users/sprite/cmds.gen/logger: /tmp/sh50e510: cannot create 554 "|/users/sprite/cmds.gen/logger sprite log 'Sprite Log'"... unknown mailer error 1 554 "|/users/sprite/cmds.gen/logger sprite log 'Sprite Log'"... unknown mailer error 1 ----- Unsent message follows ----- Received: by sprite.Berkeley.EDU (5.59/1.29) id AA729409; Fri, 12 Oct 90 15:07:04 PDT Message-Id: <9010122207.AA729409@sprite.Berkeley.EDU> To: bugs Subject: runaway shell on allspice Date: Fri, 12 Oct 90 15:07:03 PDT From: Mike Kupfer <kupfer> I just killed a csh that was sucking up lots of allspice's cycles. 10e0b READY 2:30 csh -i (it was, at various times, taking 35% of the CPU). It was owned by root; is there some way to track down who might have owned it? (Under UNIX I'd check the controlling tty and see who was logged in there. Is there some Sprite equivalent?) 1708. Date: Mon, 15 Oct 90 13:18:36 PDT From: shirriff (Ken Shirriff) Subject: Re: junk in /tmp >I use the program "rmold" (in /sprite/admin.%MACHINE) to clean up >/tmp from time to time. I just ran it now to delete things older >than 14 days. Unfortunately, it didn't get rid of all that much. This program doesn't seem to have worked. There are huge numbers of files from August and September in /tmp. 1709. Date: Mon, 15 Oct 90 14:34:15 PDT From: gibson@apathy.Berkeley.EDU (Garth Gibson) Subject: Re: migration load average In addition to John's comments, I wonder if pmake/mig'd will fairly and efficiently share a system with multiple background pmakes? Will each user/pmake get an arbitrary and varying subset of machines? Some form of scheduling maybe called for. It would also be nice if background jobs were "niced". 1710. Date: Mon, 15 Oct 90 16:56:25 PDT From: ouster (John Ousterhout) Subject: Replicated Ultrix stuff? I just noticed that we have both /ultrix/cmds.ds3100 and /sprite/ultrix/cmds.ds3100. Does anyone know why we need both of these directories? If they're redundant, can one be eliminated? 1711. Date: Tue, 16 Oct 90 10:43:04 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: migration load average if pmake worked properly, then each user would get within one host of each other. (actually, as it works now, each pmake would get within one host of each other, so one user running 2 pmakes gets twice as many machines. in retrospect this was a mistake.) since pmake is broken, and it forgets about machines after evictions, it's hard to say. as for running "niced", it might actually be possible to change pmake to do that. maybe if i have the time i'll try building a version of pmake that does that and tell you about it. should be trivial. (famous last words.) 1712. Date: Tue, 16 Oct 90 12:31:45 PDT From: shirriff (Ken Shirriff) Subject: .newsrc junked I don't know if this is file trashing, a rn bug, or related to the mail bug, but my .newsrc file got replaced this morning by an equal length of garbage. 1713. Date: Tue, 16 Oct 90 16:35:07 PDT From: Mike Kupfer <kupfer> Subject: uptime hang (whining) I tried doing an "uptime" on sage and allspice before the most recent crash. In both cases the "uptime" hung, apparently because pride was down. (I guess pride had been running the migration daemon or something.) kill -9 would not get rid of the process. Why is "uptime" so fragile, and why can't I use "kill -9" to blow processes out of the water? 1714. Date: Tue, 16 Oct 90 18:07:53 MDT From: stolcke@ICSI.Berkeley.EDU Subject: Re: Files in lost+found > > You have files in the following lost+found directories. These files were > recovered during reboot. Please examine the following directories > and recover or delete your files. > /X11/lost+found I keep getting these messages without being able to follow the advice because of the permissions on /X11/lost+found. Can anybody help? 1715. Date: Tue, 16 Oct 90 20:49:09 PDT From: ouster (John Ousterhout) Subject: lost+found directories I thought we were going to reprotect these so that they're world-writable? Bob, can you take care of this? Thanks. -John- 1716. Date: Wed, 17 Oct 90 08:46:16 PDT From: ouster (John Ousterhout) Subject: Re: migration load average I've bumped the thresholds up to 2.5-3, so Garth's simulations shouldn't prevent compilations from migrating. It would still be nice to get the low-priority stuff niced, though. 1717. Date: Wed, 17 Oct 90 09:34:14 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: uptime hang (whining) you can't kill processes that are in the middle of an RPC, and if allspice hangs, then so will the migration daemon. if pride were down and nothing else was wrong, the RPCs would time out rather than hang, and everything would work fine. (presumably.) 1718. Date: Wed, 17 Oct 90 16:25:20 PDT From: Mike Kupfer <kupfer> Subject: "info reg" can hang sun4c gdb I was trying to debug some (libc) code on sage that wasn't built -g. I did an "info reg", but gdb got stuck after "i7". Eventually sage panic'd with a "floating point exception with bad trap code" (this is with the 1.075 kernel). I think the floating point registers come after i7, and I think the problem is that the application (X) hadn't done any floating point operations, so the floating point state was uninitialized. 1719. Date: Wed, 17 Oct 90 16:40:17 PDT From: Mike Kupfer <kupfer> Subject: sun4c X died - malloc bug? I had just bugged an icon (to de-iconify it) and my X froze up. This is the X from cmds.new. Here's the stack backtrace: #0 0xabddc in malloc () #1 0x9602c in Xalloc (...) (...) #2 0x1b2e4 in miRegionCopy (...) (...) #3 0x1e03c in miSubtract (...) (...) #4 0x2463c in miComputeClips (...) (...) #5 0x245a0 in miComputeClips (...) (...) #6 0x245a0 in miComputeClips (...) (...) #7 0x245a0 in miComputeClips (...) (...) #8 0x245a0 in miComputeClips (...) (...) #9 0x245a0 in miComputeClips (...) (...) #10 0x245a0 in miComputeClips (...) (...) #11 0x24b5c in miValidateTree (...) (...) #12 0x8cd44 in MapWindow (...) (...) #13 0x6afb4 in ProcMapWindow (...) (...) #14 0x6a98c in Dispatch (...) (...) #15 0x7f010 in main (...) (...) Xalloc was trying to allocate with amount=104. There aren't any symbols for malloc, so it's hard to say much about why it died. malloc died at the following instruction: 0xabddc <malloc+244>: ld [l3],o5 "info reg" displayed l3 as l3 0x2d2d2d2d 757935405 1720. Date: Wed, 17 Oct 90 17:53:02 PDT From: Mike Kupfer <kupfer> Subject: where are the kdbx sources? /sprite/src/attcmds/kdbx didn't have a ds3100.md, which is peculiar, since the DECstations are the only machines we use kdbx on. Also, the strings in the installed kdbx don't match the strings in /sprite/src/attcmds/kdbx (e.g., "command file name must follow -c flag" instead of "missing command file name for -c"). Is /sprite/src/attcmds/kdbx in fact the right place for the sources? I can't find any other likely looking directory. By the way, /sprite/src/attcmds/kdbx doesn't build, either - it can't find <machine/psl.h>. 1721. Date: Wed, 17 Oct 90 21:41:17 PDT From: ouster (John Ousterhout) Subject: Re: where are the kdbx sources? Sorry, but there aren't any kdbx sources. Mike Nelson did kdbx at DEC and was never able to get clearance to return the sources to Berkeley. Thus we only have the binary. Think of it on the positive side: this means we don't have to worry about maintaining it. Seriously, the right long-term solution, I think, is to get gdb running on the DS3100's, but I'm hoping someone else outside Sprite will do this (if they haven't done it already). 1722. Date: Wed, 17 Oct 90 22:46:53 PDT From: Mike Kupfer <kupfer> Subject: Re: where are the kdbx sources? > Sorry, but there aren't any kdbx sources. Mike Nelson did kdbx at > DEC and was never able to get clearance to return the sources to > Berkeley. Does this mean that /sprite/src/attcmds/kdbx should be tossed, except for the man page? > Seriously, the right long-term solution, I think, is to get gdb > running on the DS3100's, The gdb sources at CMU have 3100 support, though it doesn't look like those files have made it into the FSF's gdb distribution. 1723. Date: Thu, 18 Oct 90 08:46:09 PDT From: ouster (John Ousterhout) Subject: Re: where are the kdbx sources? Although I'm not certain, I think that /sprite/src/attcmds/kdbx is the old Sun-3 version, which we don't use anymore. If this is true, then it should just be deleted, I think. 1724. Date: Thu, 18 Oct 90 10:24:33 PDT From: mendel (Mendel Rosenblum) Subject: tar/sprite fs incompatiblity For some reason tar opens files being created during extraction with the options: O_NDELAY|O_WRONLY|O_APPEND|O_CREAT|O_EXCL. The O_NDELAY causes sprite to set the stream as NON_BLOCKING. Unfortunately, non-blocking streams to regular files work differently in Sprite than Unix. In Unix, writes to non-blocking regular files behave the same as writes to blocking regular files. In Sprite, writes to non-blocking files return EWOULDBLOCK if the file cache is full. This error causes the file not to be written. I think the fix is to do the same thing we did for reads of files that block because of cache full. 1725. Date: Thu, 18 Oct 90 11:48:08 PDT From: mendel (Mendel Rosenblum) Subject: proc module deadlock while debugging I hit the following deadlock while trying to debug a user process with gdb. A process being debugged gets a signal and calls Proc_SuspendProcess() which locks the process and then grabs the debugLock to put the process on the debug list. The gdb process calls Proc_Debug with the PROC_GET_THIS_DEBUG option which grabs the debugLock and then tries to lock the process. The processes end up waiting for each other to release a lock. 1726. Date: Thu, 18 Oct 90 13:11:46 PDT From: ouster (John Ousterhout) Subject: Frame buffer verbosity When I start up X a bunch of messages appear on my screen about the frame buffer type, color map info, etc., sort of like somebody was debugging with printf's and didn't ever take out the printfs. Do these messages really need to be there? If not, can the relevant person (Mary, perhaps?) chop them out? 1727. Date: Thu, 18 Oct 90 13:32:32 PDT From: mgbaker (Mary Gray Baker) Subject: Re: Frame buffer verbosity That's because I accidentally left some of those messages in the last kernel. I removed them for the next kernel already. 1728. Date: Thu, 18 Oct 90 15:29:55 PDT From: kupfer (Mike Kupfer) Subject: can't "ps -a | more" on allspice I keep getting "Signal 22", followed by a csh job number and two pids (e.g., "[1] 80e53 80e54"). 1729. Date: Fri, 19 Oct 1990 16:05:09 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Rpc_Dispatch didn't check for short packets There was a bug in Rpc_Dispatch that caused it to byte-swap incoming RPC headers before verifying that the packet length was greater than the size of a header. This caused all sun3s with an Intel chip and the new kernel to crash, along with all sun4s. I'm not sure why the other machines didn't crash. Also, the packet in question came from the gateway "csgw". It was an IP packet to the broadcast address. It didn't occur to me until later that this was peculiar, so I don't know what the packet contained. From looking at the code the packet must have contained the correct IP protocol (NET_IP_PROTOCOL_SPRITE). Perhaps there is a conflict? I fixed Rpc_Dispatch so it checks that the packet is larger than an RPC header first. Log-Number: 30231 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 19 Oct 1990 17:21:47 PDT Subject: stat() fails if server is down The fstat() system call will fail with an invalid argument if the server is down. This means I have to modify netroute.new to ignore the error. Wouldn't it make sense if all RPCs to a server hung if the server was down? John 1730. Date: Fri, 19 Oct 90 17:40:22 PDT From: ouster (John Ousterhout) Subject: Re: stat() fails if server is down Initially some calls failed and some hung if the server was down. As time went by, Brent gradually changed more and more of the calls to hang. I think the correct behavior is probably for *all* of the calls to hang. If it's easy to change fstat to do that, I think it would be the best thing to do. 1731. Date: Fri, 19 Oct 90 22:51:23 PDT From: Mike Kupfer <kupfer> Subject: X byte swapping death Mario Silva is getting bit with the following bug when he runs a CAD program ("vem", I think) on a ds3100, using a sun4c for the X server. In the instance that I debugged, SProcCreateWindow (in dix/swapreq.c) wants to byte-swap a "create window" request, which apparently consists of a header followed by some number of bytes. The "some number of bytes" part is handled by passing a pointer and an unsigned count to a byte-swapper. The count is computed by subtracting the request length from the size of the request header. The problem is that the request length, as given to SProcCreateWindow, is less than the size of the request header, leading to a very large unsigned count. The byte-swapper (SwapLongs) eventually gets a segmentation fault. Now, I don't understand the details of X protocol processing. Does the client specify how long a particular request is, or does the server compute the request length from the request type? (It looks to me to be the former, but I'm not sure.) Could someone who understands this stuff better than I do tell me what's supposed to happen? (Another way to phrase the question is "is ReadRequestFromClient broken, or have we found a way for a faulty client to crash X?") 1732. Date: Sat, 20 Oct 90 01:53:27 PDT From: gibson (Garth Gibson) Subject: migration I'm not sure what is going on. I've started a background simulation pmake and not got very many machines. So I thought I must be competing with other background jobs or late night users, but if I rsh into the machines that are not available, all I find is the ipServer getting 2.7% of the cpu. Are the other background jobs being evicted by my login? Or is there something wrong with the migd? 1733. Date: Sat, 20 Oct 90 01:56:07 PDT From: gibson (Garth Gibson) Subject: another problem tonight when i try to tear down my simulations, kill -KILL is hanging?? garth 1734. Date: Sat, 20 Oct 90 11:46:08 PDT From: gibson (Garth Gibson) Subject: Re: migration load average I tried to do some simulation runs last night. I ran on vagrancy (ds3100) and had a script snapshoting "ps" every 2 minutes. I used the uninstalled pmake. I did notice that the simulations were running at PRI <. I also noticed that pmake issued 1 job to each machine idle when it started then NEVER ISSUED NEW JOBS to any machine (neither machines released by their user, remote machines finishing a simulation, nor even the local machine finishing a simulation. I ran pmake verbose. Here is the output: ___________________________________________________________________ /scratch3/gibson/pmake: Lockfile owned by you -- ignoring it +++ Initializing job 'Run/run.2.6/reli.db' ... Host 2 reclaimed. JobFlagForMigration(2) called. JobFlagForMigration(2) found job 'Run/run.2.9/reli.db'. ---------------------------------------------------------------------- 1735. Date: Sat, 20 Oct 90 13:28:55 PDT From: gibson (Garth Gibson) Subject: pmake problems continued I tried pmake on the sun4s this morning and it works much better than pmake on the ds3100s. It has the lower priority, so it is Fred's latest version, but it appears to be issuing on simulation completions (and still gets multiple simulations on the host). 1736. Date: Sat, 20 Oct 90 15:33:20 PDT From: mgbaker (Mary Gray Baker) Subject: kvetching with lots off cc messages Kvetching has many messages on its console about open of /sprite/cmds/cc waiting for recovery Remote exec of /sprite/cmds/cc failed: the system call was aborted by a signal Could this be related to the migration problems that Mark and Garth have been having? 1737. Date: Sat, 20 Oct 90 16:29:26 PDT From: mgbaker (Mary Gray Baker) Subject: kvetching reboot fixed hanging pmakes Rebooting kvetching fixed Mark's problem with hanging pmakes. I guess he didn't report it to the bugs alias. 1738. Date: Sun, 21 Oct 90 12:11:39 PDT From: sullivan (Mark Sullivan) Subject: bugs I'm not where to send this bug report. ftp coredumped when I was copying stuff back to shangri-la. I managed to send about a dozen files successfully, then it seg faulted. The dead process is still in the debug state on arson if you want to look at it. 1739. Date: Sun, 21 Oct 90 15:12:18 PDT From: gibson (Garth Gibson) Subject: ds3100 pmake problems Pmake on the ds3100s is now back to normal - issuing jobs for than once. Perhaps kvetching's RPC problem was hanging pmake for me as well as Mark. 1740. Date: Mon, 22 Oct 90 10:53:20 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: migration hanging a comment on kvetching screwing up migration: we've seen this before. if it should happen again, it would certainly be worthwhile for someone to debug the offending machine before rebooting it, to see what's getting locked up. (or maybe there's even an explanation of this in the sprite log, if i or someone else debugged it before. i don't recall.) 1741. Date: Mon, 22 Oct 90 09:30:09 PDT From: mendel (Mendel Rosenblum) Subject: fsattach on ds3100: Unknown option -c The fsattach called on the ds3100 during booting produces the message: Unknown option "-c"; type "fsattach -help" for information It appears that the fsattach in /boot/cmds.ds3100 is out of date. Is it safe (ie will assault reboot) to type "make install" in fsattach? I noticed that John H. has many non-checked in changes in fsattach. 1742. Date: Mon, 22 Oct 90 15:37:51 PDT From: ouster (John Ousterhout) Subject: Ethernet resets I rebooted piracy with "verynew" this morning. Since then it's gotten about 8 Ethernet chip resets. With the old kernel I doubt that I would have seen this many resets in a week, so I'm pretty sure something's working differently. What's interesting is that Piracy is a DS3100, not a Sun; did the net module even change for DS3100's? 1743. Date: Mon, 22 Oct 90 18:52:03 PDT From: shirriff (Ken Shirriff) Subject: Allspice servers died All the servers on allspice except bootp were gone (ipServer, inetd, sendmail, tftpd, etc.), so I restarted them. 1744. Date: Tue, 23 Oct 90 11:02:26 PDT From: mendel (Mendel Rosenblum) Subject: stream recovery bug This bug report describes a bug in recovering server shadow streams after the client and server lose communication. This problem only happens when the server thinks a client has crashed (eg an RPC timeout). The bug causes client's reads and writes to open files to fail after recovery with a invalid argument error message. The problem can be deadly to your machines if the files are swap files. When a client machine has an open file, it has a Fs_Stream handle that contains the current offset into the file and a pointer to the "Fsrmt_FileIOHandle" of the file. The server also keeps parallel data structures. The Fs_Stream on the server is called the "shadow stream". When the server loses communication with the client, it goes thru the entire handle table marking the FileIOHandles to note the client is no longer uses them. It doesn't do anything to the shadow streams. If the client was the only person using the I/O handle, it now because available for reuse. Once it frees the I/O handle, shadow stream now points to garbage. If the client restarts communication with the server and goes thru recovery it gets fetches and uses the shadow stream that points to garbage. Fortunately, the way our memory allocator works and the memory allocation pattern means the I/O handle most likely will be replaced by another I/O handle. There is a check in the stream reopen procedure that detects if the shadow stream is pointing at a different I/O handle than the client thinks. A message of the form: Fsio_StreamReopen, I/O handle mismatch, client 48 its I/O <10,5885666> my I/O <10,184043> is printed on the server and it fails to recover this stream. Any future I/O will also fail. Note that if the client reboots then the shaddow streams are never recovered and become yet another memory leak. The "right" thing to do is reimplement the file handles not to use broken garbage collection as its primary storage management technique. An easier fix would be to implement the ClientKill procedure (the routine that gets called when a client dies) to toss "shadow stream" only in use by the "crashed" client. An even easier fix would be to toss incorrect shadow streams found during recovery. 1745. Date: Tue, 23 Oct 90 11:16:03 PDT From: mendel (Mendel Rosenblum) Subject: Mx/tx showBindings command broken The mx and tx showBindings command procedure doesn't work. It produces a header but no binding list. It looks like: Keystroke Bindings: --------- -------- 1746. Date: Tue, 23 Oct 90 15:30:32 PDT From: Mike Kupfer <kupfer> Subject: Re: can't "ps -a | more" on allspice The problem seems to be related to piping things into "more", and it seems to depend on running on the console. For example, "cat .cshrc | more" failed frequently for me (though not always, and not just as root), but only on the console. "more .cshrc" works fine. 1747. Date: Tue, 23 Oct 90 15:43:10 PDT From: Mike Kupfer <kupfer> Subject: locking confusion in ranlib ranlib is supposed to lock the archive file before operating on it. In this case it passes a flag to ar so that ar doesn't try to lock the archive (which would presumably deadlock). There is also a "no lock" flag that tells ranlib not to lock the archive. The strangeness is that the sense of the "no lock" flag gets reversed when ranlib invokes ar. That is, if ranlib doesn't lock the archive, it tells ar not to lock it, either. If ranlib does lock the archive (the default), it tells ar to lock it. (Yet we know that ranlib doesn't deadlock.) Anyone know what's going on here? 1748. Date: Tue, 23 Oct 90 17:08:12 PDT From: Mike Kupfer <kupfer> Subject: flock() compatibility bug [This is a resubmission from last Friday. Apparently allspice crashed just as the message was being logged.] flock() has 3 operations: LOCK_EX (exclusive lock), LOCK_SH (shared lock), and LOCK_UN (unlock). Sprite acts as though there are two operations - LOCK_EX and LOCK_SH - and LOCK_UN is treated as a modifier bit. The upshot is that under UNIX you simply say flock(fd, LOCK_UN); but under Sprite you have to say flock(fd, LOCK_UN|LOCK_EX); This is painful to fix in the libc stub routine, because you have to know whether the current lock is exclusive or shared (see the Fsio_Unlock kernel routine). Would it be reasonable to change Fsio_Unlock so that if "operation" is unspecified, Fsio_Unlock will pick the correct "operation" based on the current lock flags? 1749. Date: Wed, 24 Oct 1990 14:05:34 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: migration on sun3 w/ verynew kernel If I try to run pmake on a sun3 running the verynew kernel I get the following error: Warning: Proc_MigrateTrap: error encountered sending encapsulated state: the peer process of a migrated process does not exist. The first time I tried this was followed by a complaint that the serverID for an rpc was bad (250). The second time I got a fatal error because the serverID was zero (broadcast). 1750. Date: Wed, 24 Oct 1990 16:15:12 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: writes to network devices always succeed DevNet_FsWrite always returns SUCCESS, even if you write garbage. The fix isn't trivial, but I'll look into it. 1751. Date: Wed, 24 Oct 90 19:53:04 PDT From: rab (Robert A. Bruce) Subject: NIL ioHandlePtr Sabotage crashed in FindCode(), vmSeg.c. It was running MR.052. Fsio_DeencapStream() had created an Fs_Stream structure with a NIL ioHandlePtr. Should ioHandlePtr ever be NIL? The comment in the header file is obsolete: it says to look at fsInt.h but there is no fsInt.h. Fs_HandleHeader *ioHandlePtr; /* Stream specific data used for I/O. * This really references a somewhat * larger object, see fsInt.h */ Here is the stack trace: #1 0xf60ad970 in FindCode (filePtr=(struct Fs_Stream *) 0xf6525b30, procLinkPtr=(VmProcLink *) 0xf64f5b30, usedFilePtr=(ClientData) 0xf82c9c4c) (vmSeg.c line 249) #2 0xf60ad868 in Vm_FindCode (filePtr=(struct Fs_Stream *) 0xf6525b30, procPtr=(struct Proc_ControlBlock *) 0xf64ecc30, execInfoPtrPtr=(Vm_ExecInfo **) 0xf82c9c54, usedFilePtr=(ClientData) 0xf82c9c4c) (vmSeg.c line 198) #3 0xf60a8604 in Vm_DeencapState (procPtr=(struct Proc_ControlBlock *) 0xf64ecc30, buffer=(char *) 0xf64663c4 "") (vmMigrate.c line 265) #4 0xf607db48 in ProcMigReceiveProcess (procPtr=(struct Proc_ControlBlock *) 0xf64ecc30, inBufPtr=(Proc_MigBuffer *) 0xf82c9d48) (procMigrate.c line 947) #5 0xf6085b70 in Proc_RpcMigCommand (...) (...) #6 0xf60912f0 in Rpc_Server (...) (...) #7 0xf6097598 in Sched_StartKernProc (...) (...) #8 0xf6097518 in Sched_StartKernProc (...) (...) 1752. Date: Thu, 25 Oct 90 11:00:12 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: migration on sun3 w/ verynew kernel did you get a backtrace to see who had garbaged the RPC id? anyway, it sounds like someone probably made an incompatible change to migration and -- perhaps -- migration between two verynew kernels will work. if that's the case, incrementing the migration version number and remaking verynew should help. otherwise you may have to change the migration code to account for whatever else changed in the kernel. i'm willing to consult. 1753. Date: Thu, 25 Oct 90 06:00:41 PDT From: rab (Robert A. Bruce) Subject: invisible file The file /sprite/src/kernel/sync/LOCK.make shows up if I type ``echo *'' but if I type ``ls'' it says `LOCK.make not found'. If I try to make anything in the directory it says: ``pmake: Could not create lock file LOCK.make'' The Makefile and all the *.md/*.mk files in sync are zero length. 1754. Date: Thu, 25 Oct 90 09:44:27 PDT From: tve (Thorsten von Eicken) Subject: migration trashed local migration daemons can't open the /sprite/admin/migd/pdev file to connect to the master daemon. I guess one ought to restart the global daemon. I looked into the man pages for migd, mig and migcmd, but none tells me how to restart the global daemon (I seem to remember one has to send the right signal to the local daemon), Fred, could you fix this? I also couldn't figure out where the global daemon runs at the moment, again, a pointer in the man page would be helpful. 1755. Date: Thu, 25 Oct 90 17:45:31 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: migration trashed you could try just removing the pdev file, and then kill -KILL the server. get the processID from the file /sprite/admin/migd/check or something like that. sorry that this isn't documented, but i'm afraid someone else will have to add it at this point. editing remotely isn't much fun from over here... 1756. Date: Thu, 25 Oct 1990 10:05:42 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: sprite log problems Mail from Fred ends up looking like this in the index files: 30258 ,@tempest.cs.vu.nl,@localhost.cs.vu.nl:douglis@tempest.cs.vu.nl Thu Oct 25 03:00:45 1990 442 Re: migration on sun3 w/ verynew kernel 1757. Date: Thu, 25 Oct 90 10:28:11 PDT From: tve (Thorsten von Eicken) Subject: Re: migration trashed >you could try just removing the pdev file, and then kill -KILL the >server. get the processID from the file /sprite/admin/migd/check or >something like that. Well, I found the process ID, but how do I figure out on which machine that process lives? TvE 1758. Date: Thu, 25 Oct 90 10:44:55 PDT From: mendel (Mendel Rosenblum) Subject: where does fscheck output go Where does the output from allspice's fscheck go? 1759. Date: Thu, 25 Oct 90 16:38:53 PDT From: bmiller (Bob Miller) Subject: printer problem our printer, lw533, doesn't seem to want to print anything from SUBVERSION (at least). SHALLOT is printing OK. Can someone check into this? Thanks. 1760. Date: Thu, 25 Oct 1990 17:35:46 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: allspice crash Allspice's problems were due to a number of bugs. Here is the list. 1) Allspice was crashing due to a screw-up in the memory allocater. This was very reproducible. It happened right after Allspice started answering requests for the new disk. Since the new filesystem is much larger than any other there may be a problem handling large filesystems. 2) There is a bit in the summary sector that is set when fscheck checks the disk. This should be cleared when the disk is attached, but the new kernel is not doing it. This is a very serious problem that needs to be fixed immediately. 3) The output from fscheck is not being logged correctly. 4) The kernel attaches a few disks looking for the root disks. The aforementioned fscheck bit is then cleared, causing these disks to be checked on every reboot. In summary, problem 1 was solved by removing the disk from allspice. I plan on attaching it to murder for further tests. Problems 3 and 4 aren't life-threatening. Problem 2 is temporarily fixed by having allspice check its disk on every reboot. This probably doubles the reboot time. Mendel and I are working on getting a new version of the kernel out. 1761. Date: Thu, 25 Oct 90 19:06:35 PDT From: mgbaker (Mary Gray Baker) Subject: xrn buttons In the last week, xrn seems to have lost a lot of its buttons (such as "previous" to back up to a previous message) while in message-reading mode. The binary hasn't changed for a month, so I don't know why this is. Does anyone know? 1762. Date: Thu, 25 Oct 90 20:37:41 PDT From: rab (Robert A. Bruce) Subject: hoot I set up hoot, a sun3/75 in Evans 444. It downloads the kernel ok, but when it broadcasts for the root server, allspice doesn't answer. I double checked all the files that addhost modifies, and I restarted the servers on allspice, but that didn't help. 1763. Date: Fri, 26 Oct 90 09:12:35 PDT From: mendel (Mendel Rosenblum) Subject: Bug in Proc_WaitForMigration() For some reason procMigration.c uses NULL rather than NIL for some of its "no-value" conditions. This leads to problems like the one that crashed jaywalk last night: ReturnStatus Proc_WaitForMigration(processID) Proc_PID processID; { Proc_ControlBlock *procPtr; ReturnStatus status; procPtr = Proc_LockPID(processID); if (procPtr == NULL) { return(PROC_INVALID_PID); } Proc_LockPID() returns NIL on failure not NULL. I fixed this check. 1764. Date: Fri, 26 Oct 90 13:08:46 PDT From: sullivan (Mark Sullivan) Subject: ftp bug (again) As requested, here is the second bug report. Ftp segment faults when I "mput" files to shangri-la. Reproduce by: ftp shangri-la cd cad mput context.c The file is sent successfully, then ftp dies. I was able to ftp at about 1:30 last night with no problems. No idea why it suddenly stopped working again. 1765. Date: Fri, 26 Oct 90 13:54:25 PDT From: pmchen@ginger.Berkeley.EDU (Peter M. Chen) Subject: garlic crashed During the allspice downtime (1:55pm), garlic (ds3100) crashed with MachKernelExceptionHandler: Address error on load: addr: b PC: 800c4f34 Entering debugger with a TLB load address error exception at PC 0x800c4f34 I'll leave it in the debugger for your perusal. I obviously was doing nothing, since sprite was down at the time. 1766. Date: Fri, 26 Oct 90 15:28:19 PDT From: elm (ethan miller) Subject: foo Several times today I've gotten strange bugs with mail and xrn. The mail bug occurs when I try to start up mail on a sparcstation, and it prints the following message: /usr/tmp/Rx343604: file already exists The file doesn't already exist. The same sort of problem occurs in xrn; the program is unable to create a file in /tmp. 1767. Date: Fri, 26 Oct 90 15:37:28 PDT From: shirriff (Ken Shirriff) Subject: /tmp problem cause Probably Mendel knows this already, but the problem is in ofsFileDesc.c if (fileDescPtr->flags & FSDM_FD_ALLOC) { printf( "Ofs_FileDescInit fetched non-free file desc\n"); return(FS_FILE_EXISTS); } 1768. Date: Fri, 26 Oct 1990 15:37:29 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: /tmp The /tmp problem appears to be caused by too many files. I think I've seen this before. Rebooting allspice probably isn't necessary. 1769. Date: Fri, 26 Oct 1990 16:02:12 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: vm/fs problems during exit We need to redesign the fs and vm parts of the cleanup when a process exits. Currently the fs is cleaned up first, due to a race involving pseudo-devices. This doesn't always work, since the deletion of a COW segment may involve opening the swap file. It looks to me like the fs cleanup has to be in at least two parts. Be warned that the verynew kernel could crash due to this. 1770. Date: Fri, 26 Oct 90 16:13:19 PDT From: shirriff (Ken Shirriff) Subject: ds3100 assertion crash Violence, running my version of verynew, crashed during pmake with a failed assertion in Fs_GetSegPtr. assert(((unsigned int) fileHandle & WORD_ALIGN_MASK) == 0); It failed to go into the debugger, so I couldn't find the problem. (Is this a bug with assert? It said "Fatal Error" and then kept going) 1771. Date: Sat, 27 Oct 90 15:12:30 PDT From: shirriff (Ken Shirriff) Subject: ftp bug I tried to track down the ftp seg. fault bug caused by doing mput. The problem is caused by trying to free a table of pointers that apparently weren't malloc'd. I installed the newest version of tftp, but that didn't help. So I just commented out the offending free to prevent the problem. This may cause a memory leak in ftp, but who cares. 1772. Date: Sun, 28 Oct 90 15:37:11 PST From: tve (Thorsten von Eicken) Subject: rdate broken again For example from crackle, being root, "rsh allspice echo hello" doesn't work 'cause of permissions. Hence rdate allspice doesn't work either. 1773. Date: Sun, 28 Oct 90 18:14:48 PST From: mendel (Mendel Rosenblum) Subject: ds3100 max scsi transfer size 8k? The ds3100 scsi HBA sets the maximum transfer size to be 8K. Does anyone know why? This means that LFS is forced to do I/Os in 8K chunks causing bad write performance. LFS on a ds3100 has a write rate of ~300 kilobytes per second. The sun4 can do over 3 times this. 1774. Date: Sun, 28 Oct 90 23:17:43 PST From: rab (Robert A. Bruce) Subject: tapedrive on allspice doesn't work I don't think the tapedrive works with the verynew kernel on allspice. Every time I try to use it I get Can't open `/hosts/allspice/dev/exabyte.norewind': no such device Murder is also running the verynew kernel and murder's tapedrive works fine. 1775. Date: Fri, 26 Oct 90 15:56:21 PDT From: mendel (Mendel Rosenblum) Subject: Allspices reboot and problems Allspice got beat unconscience by piracy swapping today. After piracy was killed it deadlocked on some locked file handle. I sync'ed the disk and fastbooted allspice. This appears to have left the /tmp disk with a file descriptor allocated on disk but not in the descriptor alloc bitmap. 1776. Date: Mon, 29 Oct 90 12:18:15 PST From: elm (ethan miller) Subject: failed writeback message On my sparcStation (terrorism), I get this message repeatedly: RmtFile "/sprite/admin/migd/terrorism.Berkeley.EDU.log" <10,9559> Write-back failed: out of disk space<40008> There is plenty of disk space in /sprite/admin. 1777. Date: Mon, 29 Oct 90 16:40:19 PST From: Mike Kupfer <kupfer> Subject: blackmail doesn't like my password? I can't log in on blackmail's as myself. I can rlogin in, but only if rlogin doesn't ask for my password. 1778. Date: Mon, 29 Oct 90 16:30:18 PST From: kupfer (Mike Kupfer) Subject: lack of line wrap on Blackmail's console It looks like Blackmail's console driver doesn't do line wrapping, so if your %TERM is "dumb", long lines get truncated. 1779. Date: Mon, 29 Oct 90 16:54:54 PST From: mendel (Mendel Rosenblum) Subject: varargs.h on ds3100 can't be included multiple times Varargs.h on the ds3100 is different from the rest of the machine types in that it can't be included multiple times during the same compile. 1780. Date: Mon, 29 Oct 90 16:55:57 PST From: pmchen (Peter M. Chen) Subject: time on syslog messages is wrong Could it be the daylight savings time switch? 10/29/90 17:43:09 anise (49) rebooted 10/29/90 17:47:59 anise (49) rebooted 10/29/90 17:53:47 raid1 (77) rebooted 1781. Date: Mon, 29 Oct 90 17:02:14 PST From: Mike Kupfer <kupfer> Subject: "kill" doesn't know about HUP The "kill" program doesn't understand that "HUP" means signal 1. 1782. Date: Mon, 29 Oct 90 17:11:21 PST From: Mike Kupfer <kupfer> Subject: more on rdate failures I found the following syslog entry for allspice: <28>Oct 27 04:01:18 inetd[30e34]: time/tcp accept: invalid argument I sent inetd a SIGHUP and that seems to have made rdate work again. 1783. Date: Mon, 29 Oct 90 17:45:26 PST From: rab (Robert A. Bruce) Subject: setuid programs in /sprite/cmds.symm All the setuid bits have been cleared in /sprite/cmds.symm. Does anyone know what might have caused it? This happened a while ago in /sprite/ds3100.md, and it turned out that `strip' was the culprit. 1784. Date: Tue, 30 Oct 1990 15:52:47 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: race sending ACKs I believe this is a known bug. A client acknowledges "close" requests at interrupt level. The channel structure has a special buffer for these acks, but the channel is not marked as busy while they are being sent. This means it is possible for the channel to be reused and the ack buffer to be overwritten before the packet is sent. This causes the warnings about the wrong server IDs in rpc packets from allspice. I'm not sure if this is serious, other than the lots of network resets it causes. 1785. Date: Tue, 30 Oct 90 16:11:49 PST From: Mike Kupfer <kupfer> Subject: world-writable directories in /hosts? Is there some reason why almost all the directories in /hosts are world-writable? That seems like a rather large security hole. 1786. Date: Tue, 30 Oct 90 16:17:48 PST From: Mike Kupfer <kupfer> Subject: Re: world-writable directories in /hosts? Perhaps bootcmds and other startup scripts should be moved to a protected directory. 1787. Date: Tue, 30 Oct 90 16:45:11 PST From: mendel (Mendel Rosenblum) Subject: /dev/fb crashes machines Sage crashed today while the X server was trying to do an IOControl on /dev/fb. The crash was because the devicePtr->data was NIL. I think the crash happened because of abug in the DevFBClose() routine. The code needs to check to see if the stream is still open before freeing the memory it allocated. I fixed this bug. This is also several other problems with DevFBOpen() leaking memory. It leaks memory for opens that fail and for multiple opens of the same device. I didn't fix these. 1788. Date: Wed, 31 Oct 90 13:18:15 PST From: shirriff (Ken Shirriff) Subject: chdir to file If you do chdir() to a file, Sprite returns ENOENT instead of ENOTDIR; i.e. doesn't exist as opposed to is not a directory. This should get changed as part of Unix system call compatibility. 1789. Date: Wed, 31 Oct 90 13:22:08 PST From: elm (ethan miller) Subject: problems with auto-migration in tcsh Last night I had lots of problems with the auto-migrator in tcsh. I don't know if they would apply to normal migration as well. The problem was that migrations to burble (from terrorism) seemed to die with this syslog message: <mig command> <date> burble (56) RPC timed-out This would cause the shell to tell me that the command couldn't be executed. Why are commands migrating to machines which often timeout RPCs? 1790. Date: Wed, 31 Oct 90 15:09:31 PST From: Mike Kupfer <kupfer> Subject: i386 as bug(s) There were a couple fixes I made to the i386 gas while I was at Olivetti. It looks like our version of gas doesn't have them; I'm trying to get a copy of the fixes back from CMU. In the meanwhile, beware of the following nasty: if the assembler finds an error (e.g., unrecognized opcode), it will still generate a .o file, and it will exit with status 0. Thus unless you scan the make log, you won't know that anything went wrong. 1791. Date: Thu, 01 Nov 90 10:10:34 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: problems with auto-migration in tcsh it takes a little while for the migration daemon to decide that a host is down -- it marks the host as down if either the pdev connection to the daemon on that host gets closed, or it doesn't receive an update message after a while. currently i think the grace period is on the order of a few minutes, since that's when the central daemon goes through its tables and makes a checkpoint. marking hosts as down can and should be done a bit more frequently. the easiest way to do this would be to change the checkpoint interval, though it would be possible to split the checkpointing & crash detection to be done at different rates. 1792. Date: Thu, 1 Nov 1990 11:03:50 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bug in fsdm There was a bug in fsdm that causes a machine to crash if you try to attach a disk that is already attached. I've fixed this in the uninstalled fsdm. 1793. Date: Thu, 1 Nov 90 11:19:16 PST From: shirriff (Ken Shirriff) Subject: X0msgs out of control We ran out of disk space on / because /usr/adm/X0msgs was 56 megabytes. Shouldn't something be controlling the size of this file? 1794. Date: Thu, 1 Nov 90 12:50:01 PST From: tve (Thorsten von Eicken) Subject: another silly limit in exec Apparently in the exec* system calls, the environment copied to the exec'ed process gets cut down such that none of the strings in the environment are longer than about 1000 characters. Neither SunOS, nor SysV nor RISC/OS have such a low limit. Could someone make Sprite a little bit more generous? Why is there a limit anyway? Why can't I pass a 1Meg environment if I want to do so? TvE NB: of course none of these limits are documented anywhere (at least not in any man page I could think of), which makes bug-finding rather painful. 1795. Date: Fri, 02 Nov 90 13:13:23 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: sprite distribution problems with the sprite distribution: - fsmake and installboot were dynamically-linked binaries, and fsmake wouldn't run here when it couldn't find "strstr". bob recompiled fsmake with -Bstatic and it ran okay; I tried logging into ginger and compiling installboot with -Bstatic but hit "staticld: command not found" and gave up. instead, we tried running the one from the tape and it complained about a library version mismatch but ran okay otherwise. - then when we booted sprite, it died quickly with "Fsdm_AttachDisk: setting rpc_SpriteID to 0x0 from disk header" followed by a bad rpc address. the kernel was "sprite version 1.0 rab" dated september 19. i had run fsmake with the options specified in the instructions. i am going to try running it again with an explicit "-host 1" option to see if that fixes the problem. certainly, it shouldn't be necessary. 1796. Date: Fri, 2 Nov 90 15:11:52 +0100 From: douglis@sprite.cs.vu.nl Subject: distribution bugs some more bugs: first of all, as you may gather from my last mail, i was able to get sprite up by explicitly setting the hostid in fsmake. this shouldn't be necessary or should at least be documented. also, i had to run fsmake as root in order to write the disk. again, this should be mentioned (i don't believe it is, anyway). the comment 4A about sprite not rebooting automatically doesn't apply to sun3's -- it looks like this, and 4C, were taken from the ds3100 dist. adduser has problems. the distribution doesn't mention adduser, though it exists -- it just points to "howto/addNewUser" which says to do it by hand. but doing it using adduser is only set up for the berkeley environment (/user1, /user2, /mic, etc.) and doesn't let me say just "/users/douglis" -- or at least it implies a link from /users/douglis to itself, which suggested the program would break if it tried building the directory. /usr/tmp isn't built as part of the distribution. at least, while editing this message, it came out as /usr/tmp instead of /usr/tmp/ReNNNNN and /usr/tmp is now a file rather than a directory. finally, there's no mention about configuring the system for timezones. you might have been amused that my last mail from sprite came out as 6am PST. i think this is fixed now, but only because i knew where to look. instructions for changing the time zone should be added. 1797. Date: Fri, 2 Nov 90 09:53:32 PST From: ouster (John Ousterhout) Subject: Re: X0msgs out of control I don't see why there should be a /usr/adm/X0msgs at all. Aren't X logs supposed to go in per-machine files rather than a single global file? Could there be a buggy X server around, or one with debugging output enabled to create such a huge file? 1798. Date: Fri, 2 Nov 90 09:56:37 PST From: mendel (Mendel Rosenblum) Subject: Re: X0msgs out of control This is a old bug in the X server that has been fixed along time ago. As soon as everyone is using Mary's new and improved X server it will go away. This there something stoping the new X server from being installed? 1799. Date: Fri, 2 Nov 90 11:31:22 PST From: mgbaker (Mary Gray Baker) Subject: infinite recovery on oregano Oregano has been going through infinite recovery with allspice, getting a stale handle on /. I tried to debug this, but there's no debugable kernel for the installed sun3.md/verynew (MR.054). Mendel will make a debugable kernel and then I will debug this. In the meantime, it may explain some part of allspice's poor performance. 1800. Date: Fri, 02 Nov 90 16:11:56 PST From: Mike Kupfer <kupfer> Subject: /sprite/lib/symm.md lives on /scratch3 Was there a problem with insufficient space on / or something? 1801. Date: Sun, 4 Nov 90 10:52:10 PST From: ouster (John Ousterhout) Subject: DS3100 load averages For some reason, the load averages seem to be creeping up artificially on our DS3100's. Here's a patial rup listing: gluttony ds3100 up 5+23:15 inuse 4.27 4.17 4.05 (1+16:43) heresy ds3100 up 2+00:46 inuse 2.60 2.59 2.44 (1+17:37) hijack ds3100 up 3+17:36 inuse 2.48 2.32 2.19 (0+22:50) kvetching ds3100 up 2+23:22 inuse 2.51 2.38 2.32 (0+17:25) lsisim ds3100 up 4+21:23 avail 1.00 1.41 1.59 (4+21:19) mustard ds3100 up 17+21:12 inuse 1.51 1.87 2.09 (3+10:14) parsley ds3100 up 23+02:58 inuse 2.07 2.27 2.17 (0+19:24) subversion ds3100 up 6+02:40 inuse 2.35 2.02 2.02 (1+17:47) violence ds3100 up 3+17:43 inuse 2.48 2.43 2.36 (1+11:15) I rlogin-ed to a couple of these machines and poked around a bit; the machines appeared to be quite idle. 1802. Date: Sun, 4 Nov 90 15:55:27 PST From: ouster (John Ousterhout) Subject: Allspice reboot I rebooted Allspice this afternoon after it hung up a few of my RPCs; Mendel suspected that it was the usual "something-left-locked- by-a-server-process" bug, so we didn't debug. This problem is starting to happen a lot; apparently Allspice was rebooted several times yesterday with the same problem (did I just miss the bug reports for these reboots, or is someone still in the process of typing them in?). John, can you give the tracing stuff high priority so we can get this fixed before the whole world falls apart? Thanks. P.S. When I came in this morning, Allspice was responding very very slowly. I reset its network interface and it suddenly perked up again. 1803. Date: Sun, 4 Nov 1990 23:21:25 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: problems with Fsutil_HandleRelease Allspice crashed this evening due to problems with Fsutil_HandleRelease. It turns out that if your current working directory is deleted and you cd to ".." Fsutil_HandleRelease will be called by FslclLookup with the "locked" flag set to TRUE, when in fact the handle is unlocked. This caused a panic, which I continued (I don't think anything bad will happen). Also, after looking through the code it appears to me that the problems we've been having with Proc_ServerProcs leaving handles locked is due to Fsio_FileCloseInt calling Fsutil_HandleRelease with "locked" set to FALSE as part of processing pending deletes, when in fact the handle is locked. This is only a theory at this point since I haven't tested it in a real kernel. I'm sending Brent a copy of this message because I'm confused about the purpose of the "locked" flag. Why can't Fsutil_HandleRelease look at the handle to tell whether it's locked or not? Is there ever a time where a process wants a locked handle released, but not unlocked? I can't think of a situation in which it would, but I may be overlooking something. 1804. Date: Mon, 05 Nov 90 10:57:06 +0100 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: DS3100 load averages this is something i've tried unsuccessfully to track down. i've noticed it for months. i initially suspected that the migd process was waking up at exactly the same moment that some other process was, but i put in random sleeps to try to desynchronize it and that failed. often, a machine that's totally idle (rebooted and no one has logged on) will show a load of 1.0 or 2.0. i've also noticed that starting a cpu-intensive process on one of these machines, and then terminating it, will usually cause the load to drop down to 0. all i can think of is that there's some obscure code in migd that's causing it to get confused. that, or sched_Instrument.numReadyProcesses (or whatever it's called) is inaccurate. maybe i can take a peek at migd sometime to see -- this is a nasty bug that i'd like to stamp out. 1805. Date: Mon, 5 Nov 90 09:49:12 PST From: ouster (John Ousterhout) Subject: Printing with verynew kernel I'm having a terrible time printing with the "verynew" kernel. I get zillions of "Warning: receiver overrun on serialA" messages, and if I try to print anything of any size while doing anything else on my workstation (tyranny) it never finishes. I finally gave up and rebooted "new", at which point printing started working again (it still got a few overrun messages, but not very many and everything eventually printed OK). This makes me suspect that the verynew kernel is leaving interrupts off for a long time where it didn't used to. Could there be bugs in the net module that would be causing this to happen? Could this also be responsible for some of the sluggishness we've been seeing with Allspice? 1806. Date: Mon, 5 Nov 90 13:10:15 PST From: shirriff (Ken Shirriff) Subject: Ethernet collisions John and I are getting huge numbers of "LE ethernet: Too many collisions" on our DECstations, but none on our suns. I can't get any work done on violence because of this. Does anyone know why this is happening? Log-Number: 30336 Subject: allspice reboot Date: Mon, 05 Nov 90 17:28:59 PST From: Mike Kupfer <kupfer> Allspice apparently died around 13:30, of unknown causes. Somebody rebooted it around 15:15. It died with a level 15 interrupt while doing disk checks. I rebooted it around 16:15. mike Log-Number: 30337 From: tve (Thorsten von Eicken) Subject: Re: allspice reboot Date: Mon, 05 Nov 90 17:39:43 PST I rebooted it at 15:15. It had died with an "FScache_write: alloc failed .... DISK FULL". TvE Log-Number: 30338 Date: Tue, 6 Nov 90 10:28:19 PST From: ouster (John Ousterhout) Subject: Printer problems The problem that gives my printer fits is ~ouster/dist/tcl/Tcl.man. Try cd-ing to that directory and then typing "ditroff -man Tcl.man". Then do something else with the printing workstation while it prints, like compiling or reading mail. Let me know whether this works on your printers, OK? -John- Log-Number: 30340 Date: Tue, 6 Nov 90 11:16:14 PST From: mendel (Mendel Rosenblum) Subject: Re: Printer problems Works fine on the 477 evans OusterPrinter. I printed it while compiling programs, reading mail, and eating cookies with no problems. Remember when we were saying that our printed tossed jobs when it ran out of paper and you said it didn't happen to you? May be our printers are configured differently. Mendel Log-Number: 30344 Subject: Re: Printer problems Date: Tue, 06 Nov 90 12:54:55 PST From: Mike Kupfer <kupfer> Well, I tried it twice with the printer is 608-2 (driven by Sage). The first time it failed. I got around 15 "receiver overrun" messages (around 2-3 times usual) and nothing came out. The second time I got 6 "receiver overrun" messages and it printed fine. I was reading mail and netnews at the time. mike Log-Number: 30341 Date: Tue, 6 Nov 90 11:23:09 PST From: mendel (Mendel Rosenblum) Subject: fscheck segfaults if HOST not set fscheck segfaults if the environment var HOST is not set. When I login to allspice HOST is not set so fscheck segfaults. Mendel Log-Number: 30342 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 6 Nov 1990 12:00:48 PST Subject: Re: fscheck segfaults if HOST not set I fixed this bug and installed a new fscheck. John Log-Number: 30348 Date: Tue, 6 Nov 90 22:30:22 PST From: shirriff (Ken Shirriff) Subject: tx pdev bug? A couple times I've had my tx window die with "PdevReply: extra reply data (6 > 0) </hosts/violence/tx4>" in my syslog and "ReplyWithData couldn't send pdev reply; status "address given by the user for a system call was bad" in my loging window. Any ideas what this means? Log-Number: 30352 Date: Wed, 7 Nov 90 14:01:43 PST From: gibson (Garth Gibson) Subject: gremlin I'm trying to run gremlin on sprite and display on X11R4 on SunOS 1) man gremlin gives me the old AED512 pre window system tool 2) gremlin -display apathy:0 aborts with usage: gremlin [-o] [-s <.gremlinrc>] [file] [generic tool arguments] 3) set DISPLAY = apathy:0 gremlin Couldn't open display. Are you sure X is running? 4) xgremlin seems to work, but it isn't the same tool (do people use it) Log-Number: 30353 Date: Wed, 7 Nov 90 14:03:35 PST From: gibson (Garth Gibson) Subject: grn/ditroff/psdit I added a stipple pattern to a figure (on SunOS - see last message) and tried to print the figure. Fails with error: /sprite/lib/ps/sun4.md/psdit: bad input char \055 (-) Log-Number: 30354 Date: Wed, 7 Nov 90 14:30:48 PST From: shirriff (Ken Shirriff) Subject: Proc_Migrate error I'm doing mkmf's and I keep getting: Proc_Migrate: user does not have permission to migrate. Why is this? Log-Number: 30355 Date: Wed, 7 Nov 90 14:35:18 PST From: mendel (Mendel Rosenblum) Subject: Re: Proc_Migrate error jaywalk% rsh violence migcmd -s Import Export Version Ignore Current: root root 16 It looks like migration is turned off on violence. Look at /hosts/violence/bootcmds. jaywalk% cat /hosts/violence/bootcmds # # Boot script for sage # source /boot/bootcmds #/user2/jhh/cmds.ds3100/lockd & # Allow only root to do process migration involving this machine. # (Violence isn't the most stable machine, with kernel development) migcmd -I root -E root Mendel Log-Number: 30357 Date: Thu, 8 Nov 90 14:19:37 PST From: mendel (Mendel Rosenblum) Subject: _HAS_PROTOTYPES and stdio.h, stdlib.h don't work well If a user program defines _HAS_PROTOTYPES and trys to include stdio.h and stdlib.h it gets errors. The problem with stdio.h is its prototypes use varargs.h macros and types but it doesn't include varargs.h. The problem with stdlin.h is its prototypes used types from sys/types.h such as "size_t" but doesn't include sys/types.h. Mendel Log-Number: 30359 Date: Thu, 8 Nov 90 16:54:04 PST From: mendel (Mendel Rosenblum) Subject: /sprite/cmds.sun4/mkmf overwritten /sprite/cmds.sun4/mkmf become somekind of x-based graph program for the sun4: ls -l mkmf -rwxrwxr-x 1 eklee 131072 Nov 8 14:53 mkmf* I moved it to mkmf.eklee and reinstalled mkmf. Anyone know anything about this? Mendel Log-Number: 30360 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 8 Nov 1990 17:01:30 PST Subject: allspice crash report Allspice crashed when /sprite/src/kernel filled up. Bob was copying a directory at the time. Both the new and old copies ended up in lost+found. The timer module is suspect, particularly the sun3.md and symm.md subdirectories. I think Ken, Mendel, and I have put it back together. Fscheck had to work hard to get the disk back into a sensible state. At one point it was confused by garbage in an indirect block (there is code to prevent this but perhaps it doesn't work), and it didn't update the link counts correctly when it put directories in lost+found. John Log-Number: 30362 Date: Thu, 8 Nov 90 22:52:13 PST From: sullivan (Mark Sullivan) Subject: lint doesn't understand ansi prototype decls These statements are syntax errors are far as Sprite lint is concerned: extern double atof(char *); extern int atoi(char *); extern long atol(char *); extern void abort(void); The compiler likes them just fine. Mark Log-Number: 30363 From: rab (Robert A. Bruce) Subject: Re: lint doesn't understand ansi prototype decls Date: Fri, 09 Nov 90 08:59:54 PST You need to include the file cfuncproto.h, and declare your prototypes like this: extern double atof _ARGS_((char *)); extern int atoi _ARGS_((char *)); extern long atol _ARGS_((char *)); extern void abort _ARGS_((void)); Since these are all library functions, you can use the declarations in stdlib.h: #ifdef __STDC__ #define _HAS_PROTOTYPES #endif #include <stdlib.h> -bob Log-Number: 30364 Date: Fri, 9 Nov 90 13:49:50 PST From: mgbaker (Mary Gray Baker) Subject: recovery problem Allspice glitched right when I was writing out an editor file. After allspice recovered, the write still did not complete. I finally had to interrupt the write and execute it again. Somehow the process wasn't being woken up by the recovery system. Mary Log-Number: 30367 Date: Sat, 10 Nov 90 19:06:39 PST From: mendel (Mendel Rosenblum) Subject: Moving swap directory causes infinite recovery loop If you move the swap directory of the machine to a different file server while the machine is running you can cause infinite recovery loops. The problem is that when you fork a process with swap pages on the orginial server it creates to the swap file on the new file server and trys to do COPY_BLOCK RPCs to the old file server. Because this handle is for a different machine, the orginal file server doesn't find it and returns STALE_HANDLE which causes the client machine to start recovery. After recovery, the client restarts the COPY_BLOCK and starts the loop over again. The moral of this story is to avoid moving the swap directory of a running machine. Mendel Log-Number: 30369 Subject: default TM after mkmf for shell script Date: Mon, 12 Nov 90 16:36:09 PST From: Mike Kupfer <kupfer> If I do "mkmf" in a directory for a shell script (e.g., /sprite/src/cmds/ranlib), it sets TM to default to the machine type that mkmf ran on. If I do "mkmf" in a directory for a program (e.g., /sprite/src/attcmds/tcsh) TM defaults to $(MACHINE). I assume this is a bug, probably in /sprite/lib/mkmf/mkmf.script. Can someone confirm or deny this for me? thanks, mike Log-Number: 30370 Date: Tue, 13 Nov 90 14:50:37 PST From: mgbaker (Mary Gray Baker) Subject: allspice crashes Allspice crashed several times today. It locked up so that we could not get it into the debugger. Mendel and I tried debugging it via phone-link between machine room and cad lab, but we have no conclusive results. Also, fscheck got a scsi bus error, so it may be that some files are lost. The l1d command on ginger seems to have problems. It always returns the error message "Address already in use." Mary Log-Number: 30371 Subject: "make install" didn't save old version Date: Tue, 13 Nov 90 15:56:35 PST From: Mike Kupfer <kupfer> I did a "make installall" in /sprite/src/cmds/ranlib to test out a hypothesis about a pmake-related bug. I expected to clobber the ds3100 ranlib, which would be okay because I'd just restore it from /sprite/cmds.ds3100.old. Well, when I did the "make installall" it didn't save the old ranlib in /sprite/cmds.ds3100.old, it just zapped it. Oops. (Bob, could you please restore /sprite/cmd.ds3100/ranlib from tape? Thanks.) mike Log-Number: 30372 Subject: MACHINES and TM variables in Makefile Date: Tue, 13 Nov 90 16:04:44 PST From: Mike Kupfer <kupfer> In pmake, just what exactly is the role of MACHINES? The comments in the foo.mk files say "list of all target machines currently available for this program". However, this list is apparently only used for things like "make installall". There's no check for whether TM is in MACHINES. This leads me to the following bug: the MACHINES variable in a "script" Makefile is set to all the known machine types. (/sprite/lib/mkmf/mkmf.script is responsible for this.) This is not always correct behavior. For example, /sprite/src/cmds/ranlib should not be installed on a ds3100. There is a sort-of related bug in the way TM is handled. The TM variable in a "script" Makefile defaults to whatever machine you ran mkmf on. So if I run mkmf on a sun3, then do a "make install" on a symm, the script gets installed in cmds.sun3 (when I really want it installed in cmds.symm). It seems to me that scripts should be handled more like programs. That is, if MACHINES is only one machine type, TM should default to that type. If MACHINES is more than one type, TM should default to $MACHINE. Comments, anyone? mike Log-Number: 30373 Date: Tue, 13 Nov 90 17:07:31 PST From: shirriff (Ken Shirriff) Subject: ds5000 tftp is flaky The tftp boot on the ds5000 takes over 4 minutes. The problem is that the tftp implementation on the ds5000 seems to be flaky. (To review tftp: after establishing a connection, the server sends a numbered 512 byte block and the client acknowledges reciept.) On a normal machine we get: server sends 1. client acks 1. server sends 2. client acks 2. etc. On the ds5000 we get: server sends 1. client acks 1. server sends 2. client acks 1. (i.e. it doesn't accept 2) server resends 2. client acks 1. server resends 2. client acks 2. (finally) server sends 3. client acks 2. etc. We end up sending each block 4 times before the ds5000 accepts it. This happens with tftp from ginger and from allspice, so it's not our implementation problem. (Although tftp from ginger is about twice as fast as from allspice.) Ken Log-Number: 30375 Date: Wed, 14 Nov 90 00:23:33 PST From: tve@ginger.Berkeley.EDU (Thorsten von Eicken) Subject: ip server on anise died in the last half hour Is there a "checkIPserver" enrty in crontab for anise? TvE Log-Number: 30376 From: rab (Robert A. Bruce) Subject: anise out of memory Date: Wed, 14 Nov 90 01:46:04 PST Anise ran out of memory in Vm_RawAlloc. -bob Log-Number: 30379 From: mendel (Mendel Rosenblum) Subject: Re: anise out of memory Date: Wed, 14 Nov 90 10:36:38 PST >Subject: anise out of memory >Date: Wed, 14 Nov 90 01:46:04 PST > >Anise ran out of memory in Vm_RawAlloc. > > -bob Let's cross our fingers and hope this doesn't happen again. It might of been related to some file cache size playing around I didn't after I rebooted anise. I meant to set the maximum file cache size very large and managed to set the miminum file cache size large. This could of cause anise to run out of memory if it needed to grow the kernel for file handles or something. Mendel Log-Number: 30377 Date: Wed, 14 Nov 90 09:00:09 PST From: ouster (John Ousterhout) Subject: /tmp not getting exported right When I came in this morning I rebooted both tyranny and piracy to move their swap directories to /swap2. However, after I did this neither machine had access to /tmp, for example: tyranny: cd /tmp /tmp: no such file or directory In order to get access to /tmp I had to type "prefix -h tyranny -x /tmp" on Anise, followed by "prefix -a /tmp -s anise" on Tyranny, and ditto for Piracy. -John- Log-Number: 30378 From: mendel (Mendel Rosenblum) Subject: Re: /tmp not getting exported right Date: Wed, 14 Nov 90 10:30:56 PST The real problem here is that someone or something delete the remote link /tmp. This meant that tyranny and any other machine rebooted would not find a /tmp. A little known feature of prefixs was set into action next. The "prefix -h tyranny -x /tmp" had the effect of only allowing tyranny to import /tmp. Using the "-h" option on a prefix that was previously freely exported has the effect of creating an export list with only that host on it. This explains why piracy couldn't import /tmp. We need to a) Find who or what is deleting /tmp. This use to occur when we had /tmp as a remote link to orgeano and the problem went away when /tmp became a directory. b) Either change the prefix export list stuff or make sure everyone knows how it works. Also, I try to delete the prefix /tmp from anise in order to nuke the export restriction list. The command prefix -x /tmp did nothing except produce the message: prefix coundn't delete prefix: there was an error There was no additional info in the syslog. To patch the problem without rebooting anise, I explictly exported /tmp to every host with a swap directory pointed to in /swap. Mendel Log-Number: 30380 Date: Wed, 14 Nov 90 13:51:23 PST From: mendel (Mendel Rosenblum) Subject: Kernel bloat Here is a quick summary of memory usage on anise: Kernel Size 19.5 Megabytes 61% File Cache Size 7.9 Megabytes 25% User Mem Size 4.4 Megabytes 13% Other .2 Megabytes 1% Total Mem 32.0 Megabytes 100% So on a 32 megabyte machine you get around 8 megabytes of usable file cache. Of the kernel mem for the file system state: Local File handles: 5.6 Megabytes File Descriptors attached to handles: 2.3 Megabytes Buffers for LFS cleaning and writing: 1.25 Megabytes Remote File handles: 1.0 Megabytes ClientInfo state: 954 Kilobytes Hash table for handles: 851 Kilobytes So the file system state allocated for bookkeeping (handles, ClientInfo, hash table) is over 8.4 megabytes. Another way of looking at this is the data describing the cached blocks is larger than the cached blocks. Mendel Log-Number: 30381 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 15 Nov 1990 15:45:37 PST Subject: shadow stream offsets updated incorrectly The offset in shadow streams are not updated correctly when a client process does an lseek(). Lseeks must be propagated through to the server when processes that share a stream are on different hosts. The following illustrates the problem: Process A opens a file. Process A forks process B. Process B migrates to a different machine. Process A seeks to end of file. Process B reads from file. This should return 0 (end of file). Instead process B will read from the start of the file. I was looking at the code in preparation for adding lseek RPCs for the SOSP paper and couldn't figure out how it worked. A little test program I wrote shows that it doesn't. John Log-Number: 30382 From: mendel (Mendel Rosenblum) Subject: Re: shadow stream offsets updated incorrectly Date: Thu, 15 Nov 90 16:41:22 PST There might be an easy fix for this. Since seeks() are done with IOControls in Sprite ,just pass the IOC_REPOSITION ioctl on to the server if the file is marked as non-cachable. The receiving stub routine is already setup to handle this IOControl. This is the same change as I did with the WRITE_BACK ioctl get the clients to force files all the way thru to disk. Mendel Log-Number: 30383 Date: Thu, 15 Nov 90 21:41:40 PST From: root (The Sprite God) Subject: anise crash At about 9:15pm, anise went into the debugger. Its screen was blank, so I couldn't see what was wrong. I also couldn't find a kernel to use for debugging. So I ended up continuing it, since there was a chance that it was some mousetrap Mendel had mentioned to Thorsten earlier today. It seemed to recover nicely, but I don't know if I messed something up by continuing it. I hope not. Mary Log-Number: 30384 Date: Fri, 16 Nov 90 11:46:35 PST From: tve (Thorsten von Eicken) Subject: X server on the sparc 1+ why is it still different? What do I have to do to get it started? When the server hangs on a sparc, L1-K gives me the keyboard back, but ctrl-C doesn't kill the server, or anything, so it's of no use. The only way out is to reboot. TvE Log-Number: 30385 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 16 Nov 1990 12:35:32 PST Subject: more shared stream problems The bug I reported yesterday is a subset of a much larger bug. When a process is migrated that is sharing a stream with another process, both streams on both clients need to be marked as remote-shared so that the offset is maintained on the server. I don't see any support for this in the code, nor does Brent's dissertation talk about it. I'm not sure if the migrated stream is marked correctly since I haven't looked at the code, but I do know that the non-migrated stream is not marked. For this to happen the server must do a callback as part of the migration. The bottom line is that shared streams to cacheable files do not work if the streams are on different clients. John Log-Number: 30387 Date: Fri, 16 Nov 90 13:13:27 PST From: ouster (John Ousterhout) Subject: Re: more shared stream problems Perhaps I'm missing something, but something doesn't sound right about John's last message. Here's his message: The bug I reported yesterday is a subset of a much larger bug. When a process is migrated that is sharing a stream with another process, both streams on both clients need to be marked as remote-shared so that the offset is maintained on the server. I don't see any support for this in the code, nor does Brent's dissertation talk about it. I'm not sure if the migrated stream is marked correctly since I haven't looked at the code, but I do know that the non-migrated stream is not marked. For this to happen the server must do a callback as part of the migration. The bottom line is that shared streams to cacheable files do not work if the streams are on different clients. I'm not sure exactly what scenario is being referred to here, but let's consider a couple of different cases: 1. If there are two streams for the same file on two different clients, there's no need to worry about the second stream when the first one migrates (or when one of several processes sharing the first stream migrates). If the two streams are independent (from different opens) then there's no problem in the first place; if they are actually handles for a single shared stream, then they should have each been marked "shared" a long time ago, when one of them migrated away from the other. 2. If the scenario is two streams for the same file on the same client, then by definition these are independent streams (separate access positions, etc.), so there's nothing to worry about, right? The only time action is needed is if a stream is shared on one client and one of the sharers migrates away; at this point both handles need to get marked as shared. Are you saying that in this situation the "handle left behind" doesn't get marked? The point where this should occur, I think, is when the file server calls back to the source host during migration to fetch the access position for the file. -John- Log-Number: 30389 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 16 Nov 1990 13:39:50 PST Subject: Re: more shared stream problems Sorry that my original message wasn't very clear. John O. hit the nail on the head in his final paragraph. Imagine a stream that is shared on one client. When one of the sharers migrates away both handles (actually I think the object are referred to as "streams", hence some of the confusion) must be marked as shared. This currently does not happen. This can trivially be proven by grepping for FS_RMT_SHARED in the kernel sources. The only place in which this bit is set in a stream's flags is in Fsio_StreamMigClient() which is called on the IO server when a stream migrates. For this reason I don't think the shared stream is marked appropriately on either client. I know by experimentation that it is not set in the "handle left behind". John Log-Number: 30386 Date: Fri, 16 Nov 90 12:54:33 PST From: shirriff (Ken Shirriff) Subject: Assault crash Assault was totally dead and was repeating "TI: 7" on the screen. Ken Log-Number: 30388 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 16 Nov 1990 13:00:28 PST Subject: migration bug I foolishly tried to have a process migrate itself. On a sun3 the Proc_Migrate() call fails with an "system call interrupted by signal" error. This sort of makes sense I guess. On a ds3100 the machine goes into the debugger when it tries to insert a duplicate entry in the TLB when it sets up a COW segment. Clearly this is incorrect. If a process isn't allowed to migrate itself then Proc_Migrate() should catch this early and print out a meaningful error message. John Log-Number: 30393 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: migration bug Date: Mon, 19 Nov 90 09:40:49 +0100 a process is supposed to be able to migrate itself. however, this is not something that's tested in day-to-day use, and may never have been tested on the ds3100. in fact, it's possible it wasn't tested after the change long ago to treat migration as a signal. in any case, yes, this is definitely a bug. i'd recommend deferring action on it until the migration code is rewritten to deal with resuming system calls. (in the best of all possible worlds, of course :) Fred Log-Number: 30390 Date: Sat, 17 Nov 90 17:36:05 PST From: dingle (Adam T. Dingle) Subject: can't unarchive large file I have an archive (cl.tar) which contains a large file which I wish to extract. Unfortunately, I get some sort of file error when I attempt to extract it: % tar xvf cl.tar ./build/files.bu -rw-r--r-- 2628/0 6824389 Mar 30 12:10 1990 ./build/files.bu tar: Tried to write 4096 bytes to file, could only write -1: ./build/files.bu: operation would block At this point the tar process continues to run for awhile, but the extracted file ends up being 1061376 bytes long, so presumably it is being truncated somewhere. There is plenty of disk space on the partition where I am extracting the file. The file "cl.tar" is in /pcs/cl/allegro.new. Any suggestions? -Adam Log-Number: 30391 From: mendel (Mendel Rosenblum) Subject: Re: can't unarchive large file Date: Sat, 17 Nov 90 17:46:11 PST >I have an archive (cl.tar) which contains a large file which I wish to >extract. Unfortunately, I get some sort of file error when I attempt >to extract it: > >% tar xvf cl.tar ./build/files.bu >-rw-r--r-- 2628/0 6824389 Mar 30 12:10 1990 ./build/files.bu >tar: Tried to write 4096 bytes to file, could only write -1: >./build/files.bu: operation would block > >At this point the tar process continues to run for awhile, but the >extracted file ends up being 1061376 bytes long, so presumably it is >being truncated somewhere. > >There is plenty of disk space on the partition where I am extracting >the file. The file "cl.tar" is in /pcs/cl/allegro.new. > >Any suggestions? > >-Adam This is caused by the bug I reported a while ago, Sprite log message 30223: >From mendel Thu Oct 18 10:24:33 1990 Received: by sprite.Berkeley.EDU (5.59/1.29) id AA922165; Thu, 18 Oct 90 10:24:33 PDT Date: Thu, 18 Oct 90 10:24:33 PDT >From: mendel (Mendel Rosenblum) Message-Id: <9010181724.AA922165@sprite.Berkeley.EDU> To: bugs Subject: tar/sprite fs incompatiblity For some reason tar opens files being created during extraction with the options: O_NDELAY|O_WRONLY|O_APPEND|O_CREAT|O_EXCL. The O_NDELAY causes sprite to set the stream as NON_BLOCKING. Unfortunately, non-blocking streams to regular files work differently in Sprite than Unix. In Unix, writes to non-blocking regular files behave the same as writes to blocking regular files. In Sprite, writes to non-blocking files return EWOULDBLOCK if the file cache is full. This error causes the file not to be written. I think the fix is to do the same thing we did for reads of files that block because of cache full. Mendel Log-Number: 30394 Date: Mon, 19 Nov 90 12:07:06 PST From: mendel (Mendel Rosenblum) Subject: LFS problems I was able to patch /pcs this morning and it is back online. No data was lost. I'm going to leave /swap2 down until I get back. /user5 is ok. It has a few minor problems that shouldn't cause any crashes. The good news is that all the problems we are having with LFS are from the same bug. The bad news is I haven't found the bug yet. The problem is that while laying data out in a segment it is recording incorrect disk address in the index that point at the blocks. I haven't figured out the sequence of events that causes it to happen. It appears to get back on track for the next log write. If you are luckly, you overwrite all the data with bad index pointers before you need to read them from disk again. I suspect that this might be some kind of glitch that occurs between shutdown and attach. Until I figure out this problem we should probably avoid doing lots of stuff in the LFS partitions. Sorry, Mendel 1842. Date: Tue, 20 Nov 90 12:27:05 PST From: eklee (Edward K. Lee) Subject: possible tx bug Moving a tx window around using the geometry command produces a messed-up vi window when vi is subsequently invoked. Making the tx window smaller and then larger again with the geometry command seems to solve the problem. For example: geometry =80x32+0-20 geometry =80x32-0-20 vi <messed up vi window> geometry =80x23+0+0 geometry =80x32-0-20 vi <vi window ok> 1843. Date: Wed, 21 Nov 90 19:17:54 PST From: Mike Kupfer <kupfer> Subject: RCS file for spritehosts corrupted sage% rlog spritehosts RCS file: RCS/spritehosts,v; Working file: spritehosts head: 1.81 branch: locks: ; strict access list: symbolic names: comment leader: "# " total revisions: 81; selected revisions: 81 description: database of sprite machines rlog error: Missing line number in edit script rlog aborted Should we patch it by hand or restore it from tape? 1844. Date: Mon, 26 Nov 90 09:37:39 PST From: ouster@dill (John Ousterhout) Subject: Allspice crash When I came in this morning Allspice was in the debugger with a page fault in the kernel, pc = 0x0, address = 0x0. I rebooted it, but it got the same error again as soon as it got into recovery. When I explored with the debugger, it turned out that Allspice was calling location 0 through a dispatch table, at line 999 of fsioStream.c, in Fsio_StreamReopen. The reason for this was that reopenParamsPtr pointed to the following: streamID: type: 0 serverID: 14 major: 10 minor: 39536 ioFileID: type: -1 serverID: 0 major: 0 minor: 0 The -1 value of reopenParamsPtr->ioFileID.type led to the branch to zero. The machine causing the problem was coons (id 82); to allow Allspice to reboot, I L1-A'ed coons. I'm not sure what's going on here, but at the very least it seems like the server should check for a valid type before dispatching, no? -John- 1845. Date: Tue, 27 Nov 90 11:12:49 PST From: ouster (John Ousterhout) Subject: Changed List_ stuff back again In trying to make a distributable version of Tk, I discovererd that the List_ procedures don't compile without sys.h being available, and that you recently added the "#include <sys.h>" lines (as part of the prototyping?). I've removed these #include statements and added explicit panic declarations by hand. The main reason for this is that the file <sys.h> doesn't work in general: you have to have the *kernel's* sys.h. User programs will pick up /usr/include/sys.h, which is different (sigh) and won't work. Also, I didn't want to have to distribute all sorts of extra include files along with the List module. Putting in the explicit declarations goes against our coding style, but I couldn't think of anything else any cleaner (declare a separate panic.h with only one declaration? Add the panic declaration to sprite.h?) 1846. Date: Tue, 27 Nov 90 13:08:30 PST From: Mike Kupfer <kupfer> Subject: Re: Changed List_ stuff back again Yes, that must have been done as part of the prototyping. I thought I rebuilt the entire C library after making those changes. I wonder why it seemed to work then but is now causing you problems. I'm not particularly bothered by putting an explicit panic() declaration in the .c files, though it might be a good idea to add a comment saying why we violated the Sprite coding guidelines. Having said that, though, I think the panic() declaration should go in a header file. panic() is not a standard C library routine, so we're going to have to provide it in the Tk distribution. As long as we're providing the routine itself, we might as well provide a header file that declares it. So, the question is, which one? If we're providing sprite.h in the distribution, that's one candidate, though it currently defines only typedefs and constants. Are there other general header files that we're already planning to include in the distribution? 1847. Date: Tue, 27 Nov 90 15:50:58 PST From: mendel (Mendel Rosenblum) Subject: Re: X can't start with ginger down >Return-Path: shirriff >Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA272946; Tue, 27 Nov 90 15:39:54 PST >Date: Tue, 27 Nov 90 15:39:54 PST >From: shirriff (Ken Shirriff) M>essage-Id: <9011272339.AA272946@sprite.Berkeley.EDU> >To: bugs S>ubject: X can't start with ginger down >While ginger was down, xinit would wedge up. Ginger came back up before >I could find why it was wedging. This happened on a sun4c and a decstation. >Presumably xinit was accessing something mounted on ginger, but why? > >Ken The problem is that ginger is the primary internet domain nameserver for sprite and the X server does many name lookups when it is started. Each name lookup timeouts on ginger before moving on to the backup name server (arpa). This causes X start up to take a very long (30 minutes). By the way, Ginger didn't come back. I switch the primary and backup name servers so X now starts much faster. 1848. Date: Tue, 27 Nov 90 23:08:47 PST From: shirriff (Ken Shirriff) Subject: Assault crashed Assault crashed because it couldn't reinitialize the Lance chip. 1849. Date: Wed, 28 Nov 90 12:35:42 PST From: mendel (Mendel Rosenblum) Subject: Booting problem fixed The problem with Sprite not booting was because /dev/console's descriptor got changed to point at a different device. The caused the open of /dev/console in initsprite to fail causing initsprite to do a Sys_TestPrintf() system call and exit. The Sys_TestPrintf() system call prints "Obsolete system call" and returns failure. I had to set a breakpoint in the routine printing "Obsolete system call" to figure this out. I renamed /dev/console to /dev/console.bad and created a correct /dev/console. It looks like: jaywalk% stat console D-rw-r--r-- 1 ID=(0,1) 0 bytes console Server Domain File # Device: Server Type Unit 14 10 21545 1 23 5 Version 2 UserType 0x0 Created: Jul 3 22:20:58 1990 Data modified: Jul 3 22:20:30 1990 Descr. modified: Nov 28 11:38:02 1990 Last accessed: Jul 3 22:14:58 1990 1850. Date: Wed, 28 Nov 1990 13:54:20 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Reinit recv unit The "Reinit recv unit" errors that are prevalent on allspice are because the receive unit of the ethernet chip is out of resources and is discarding packets. The correct action to take in this case is to reinitialize the unit. I've seen similar behavior on a ds5000, which leads me to believe that we are turning off interrupts for an excessively long time somewhere in the kernel. Solutions are to avoid turning off interrupts for so long, and/or to increase the number of receive buffers. I don't know the feasibility of the latter. 1851. Date: Wed, 28 Nov 90 14:39:41 PST From: ouster (John Ousterhout) Subject: Re: Reinit recv unit I think that the new kernel is leaving interrupts off longer than the old ones did. I don't know why this is happening, but I suspect the changes that John made to the net module. My troubles printing are one example: it continues to be nearly impossible for me to print anything interesting while doing anything else interesting on my workstation, but only with the new kernel. The Reinit recv unit problems are another example. John, can you give some thought to how we might track down the source of the long interrupts-off intervals? For example, would it be possible to use Ken's technique of shortening the timer interrupt interval to identify the problem spots? 1852. Date: Wed, 28 Nov 90 17:20:18 PST From: shirriff (Ken Shirriff) Subject: Net for sun4's In order to run mop on the sun4, I'll need to change the net module to accept the broadcast packets. What drivers do the sun4 and sun4c's use? 1853. Date: Wed, 28 Nov 90 17:42:38 PST From: mendel (Mendel Rosenblum) Subject: Allspice glitch this morning The cause of Allspice's glitch this morning remains a mystery. The problem appeared to be a network-wide deadlock centering around the file /etc/spritehosts. (For some reason, allspice was reporting the errors on the file ",RCSt1925522" <10,90589> which is the inumber of /etc/spritehosts). When I pulled the network interface out of allspice it gave many consist timeouts of the form: <consist> 11/28/90 10:43:26 parsley (20) RPC timed-out ClientCommand, return-attrs msg to client 20 file ",RCSt1925522" <10,90589> failed 30002 Client state killed: 1 refs 0 write 0 exec <consist> 11/28/90 10:43:32 loiter (83) RPC timed-out ClientCommand, return-attrs msg to client 83 file ",RCSt1925522" <10,90589> failed 30002 Client state killed: 1 refs 0 write 0 exec and it became usable again. Reconnecting the network interface caused it to hang up again. I put sage into the debugger because it appeared to be hanging consist rpcs and everything cleared up. Debuggering sage didn't turn up anything except it was in the progress of reopening handles with allspice. Maybe there is a deadlock involving recovery and consist callbacks. It is also possible that it would of cleared itself up evening if I hadn't killed sage. 1854. Date: Thu, 29 Nov 90 16:20:03 PST From: ouster (John Ousterhout) Subject: /tmp on Assault I just noticed that /tmp is now on /user2 on Assault. Is there a particular reason why it's on Assault instead of Allspice? Putting it on Assault means that virtually no-one will be able to get work done when Assault is down; it used to be that Assault didn't affect very many people. 1855. Date: Thu, 29 Nov 90 16:20:02 PST From: shirriff (Ken Shirriff) Subject: assault crash Assault wasn't responding to pings, but it seemed okay from the console. It wedged up when I tried to ping allspice from it. When I tried to debug it, it went into an infinite address error on load in Net_RawOutput loop and wouldn't go into the debugger. 1856. Date: Thu, 29 Nov 90 16:26:29 PST From: Mike Kupfer <kupfer> Subject: Re: /tmp on Assault Didn't we put it on Anise for awhile? I don't recall why we put it on Assault instead of moving it back to Allspice, though, when we had LFS problems last week. 1857. Date: Thu, 29 Nov 90 17:21:55 PST From: shirriff (Ken Shirriff) Subject: Random pmake problem I've had a couple random errors: as1: Error: , line 0: Obsolete or corrupt binasm file: / 1858. Date: Fri, 30 Nov 90 09:42:06 PST From: ouster (John Ousterhout) Subject: Allspice crash Allspice crashed again today, with the same poison-packet problem I reported last week ago. I believe that this makes 3 crashes from this problem in the last week. I think that we should do something about this *now*, before the problem starts happening so frequently that we can't even keep Allspice up long enough to compile a new kernel. I think that two things need to be done: 1. Modify the server code so that it detects a -1 ioStream.type in reopen calls. When this happens, I'd suggest that the server print a message (so that the offending client can be identified) and return a re-open error. 2. Modify the client code to panic when a -1 type is about to be sent off during a reopen. This way we should be able to find the cause and fix it. Can someone take care of doing this ASAP? Above all, I think we need to get #1 done and installed for Allspice so that the system can survive the poison packets. 1859. Date: Fri, 30 Nov 90 16:25:54 PST From: mendel (Mendel Rosenblum) Subject: make account script doesn't like sun4s The script that creates accounts makes the directory cmds.sun3, cmds.ind, and cmds.ds3100 in the new account but not cmds.sun4. 1860. Date: Fri, 30 Nov 90 16:55:11 PST From: rab (Robert A. Bruce) Subject: Re: make account script doesn't like sun4s The adduser script creates the user's directory by copying ~newuser to the user's directory. ~newuser didn't have a cmds.sun4 directory, so the target directory didn't get it either. If new user directories are set up wrong, change the prototype directory in /user1/newuser. 1861. Date: Sat, 1 Dec 90 19:04:57 PST From: pmchen (Peter M. Chen) Subject: assault is dead It pings and fingers ok, but doesn't respond as a file server. 1862. Date: Sat, 1 Dec 90 19:12:50 PST From: gibson (Garth Gibson) Subject: remote mounting of SunOS disks It appears that when I leave a shell for a long time with current directory on ginger the connection is lost (nfsmount daemon dies?). When I return, "ls" reports no contents (it does not cause a new nfsmount to be created?). However, when I "cd" to the dirtectory I'm currently in, the connection is re-established (a new daemon is created?). espionage 62> dirs ~/unix/home/Thesis/Arrays/Text espionage 63> ls espionage 64> ls espionage 65> cd ~/unix/home/Thesis/Arrays/Text espionage 66> ls total 411 1 1_2_d.grn@ 1 growth.me@ 1 2d_hamming_eg.grn@ 20 growth.n 1 ASPLOS@ 16 hamming.grn 1 ASPLOS_dir/ 1 hdr.me 1 CMG@ 1 incharray.grn@ 1 CMG_dir/ 23 incorrect.hamming.grn 3 ECC.problems 1 interleave.grn@ 3 Int.talk/ 1 monte_carlo.grn@ 1 Makefile@ 1 nonbinaryhamming.awk 1 Nd_ega.grn@ 1 outline@ 1 SCCS/ 1 raid.perf.grn@ 7 bib.me 1 rel.tbl 13 binary.me 3 reliability2.grn 1 check_matrix1.grn@ 1 rotate.grn@ 7 check_sets.grn 1 stack.grn@ 1 chen_a.grn@ 4 tandem.grn 1 chen_b.grn@ 78 text.me 1 codes-2a.grn@ 6 text.me.old 1 fracs/ 200 text.n 1 gallager 1 trlr.me 1 growth.grn@ 1863. Date: Sat, 01 Dec 90 19:26:04 PST From: mendel (Mendel Rosenblum) Subject: Re: assault For some reason the nfsmount for /home/ginger/raid was nowhere to be found. I restarted it and access to the files on /home/ginger/raid appears to be restored. 1864. Date: Sat, 01 Dec 90 23:53:41 PST From: Mike Kupfer <kupfer> Subject: making arrays of RPC counts >From rpc/rpcCall.h: /* * RPC_LAST_COMMAND is used to declare the rpc procedure switch * and arrays of counters for each rpc. */ This is a bit inconvenient, though, because the rpc numbers range from 0 (RPC_BAD_COMMAND) to RPC_LAST_COMMAND inclusive, so you usually want to define your array to be foo[RPC_LAST_COMMAND+1]. Would anyone object if I added a line #define RPC_NUM_OF_COMMANDS (RPC_LAST_COMMAND+1) 1865. Date: Sun, 2 Dec 1990 14:23:58 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: freopen Freopen on stdout only appears to work if you've already printed something to stdout. Otherwise subsequent output never shows up. I glanced through the code for freopen but couldn't see anything obviously wrong. 1866. Date: Sun, 02 Dec 90 17:25:49 PST From: Mike Kupfer <kupfer> Subject: no check for decreasing RPC ID? I was looking at RpcServerDispatch() and noticed something odd. There is a check to see if a packet's RPC ID is different than the current ID (the one that the server expects). If it is, the comments and the code say that this means the packet is for a new RPC. However, the check is for unequal values, not for an increasing value. Does this mean the server is (naively?) assuming that it won't get old packets with discarded RPC IDs, or am I looking at the wrong code? 1867. Date: Sun, 02 Dec 90 19:10:15 PST From: Mike Kupfer <kupfer> Subject: bogus date after copying file from NFS sage% pwd /home/ginger/sprite/users/kupfer sage% ls -l mkfs -rwxr-xr-x 1 kupfer 32768 Aug 29 1989 mkfs* sage% cp mkfs ~/foo sage% ls -l ~/foo -rwxr-xr-x 1 kupfer 32768 Dec 31 1969 /users/kupfer/foo* sage% alias cp cp -ip sage% alias ls ls -F Copying in the reverse direction (Sprite to Ginger) doesn't show this problem. Log-Number: 30435 From: mendel (Mendel Rosenblum) Subject: Re: assault Date: Sat, 01 Dec 90 19:26:04 PST For some reason the nfsmount for /home/ginger/raid was nowhere to be found. I restarted it and access to the files on /home/ginger/raid appears to be restored. Mendel Log-Number: 30436 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 4 Dec 1990 10:50:38 PST Subject: newtee Newtee seems to copy /dev/syslog into both an output file and stdout. This means the output goes to /dev/console, which means you can't run X on the machine very well. I had to kill the newtee on assault so I could use the console. John Log-Number: 30442 Date: Tue, 4 Dec 90 14:26:22 PST From: elm (ethan miller) Subject: reappearing mail Sometime between last night and just after the first assault crash today (Tuesday 4 Dec about 1PM), about 5 messages were re-delivered to my mailbox. It's no big deal for me (I just deleted them), but it might be a symptom of a problem elsewhere. ethan Log-Number: 30444 Date: Tue, 4 Dec 90 15:59:09 PST From: pmchen (Peter M. Chen) Subject: finger doesn't work I am not able to run finger. I get an "Illegal instruction" error, and my syslog prints out Bogus bp-trap This is on garlic (ds3100). Pete Log-Number: 30445 Date: Tue, 4 Dec 90 16:00:09 PST From: pmchen (Peter M. Chen) Subject: finger problem just cleared up Dunno what happened, but now there appears to be no problem. I'm still curious as to what happened, though, so if anyone has any clues... Pete Log-Number: 30446 Date: Tue, 4 Dec 90 16:59:43 PST From: shirriff (Ken Shirriff) Subject: Pmake problem If I type "pmake TM=foo", in a kernel directory, I get: ld -r $(sh: syntax error at line 2: `(' unexpected Wouldn't a more intuitive error message be appropriate? Log-Number: 30447 Date: Tue, 4 Dec 90 18:28:33 PST From: bsw!adam@uunet.UU.NET (Adam de Boor) Subject: Re: Pmake problem the message: ld -r $(sh: syntax error at line 2: `(' unexpected comes from the shell when there's a variable being used that's not defined in pmake. if you say "ld -r $(LDFLAGS)" and LDFLAGS isn't defined, that's exactly what the shell will see. This is one of the ways that pmake differs from make. If you run "pmake -n TM=foo", you should be able to see what variable isn't being defined. a Log-Number: 30449 Date: Wed, 5 Dec 90 10:33:20 PST From: mendel (Mendel Rosenblum) Subject: Allspice close to death Allspice is very close to running out of RPC servers because of the number of servers deadlocked. The problem is that block 78 of the file /sprite/cmds.sun4/cc1.68k is marked as having IO_IN_PROGRESS yet I can't find any process doing IO on the block. A RPC from tyranny is trying to read the block and is stuck waiting for the IO to finish. This RPC is waiting with the monitor lock in the cacheInfo struct held which causes most other RPCs on the file such as opens and stats to wait. Currently, jaywalk, sedition, sage, boing, tyranny, treason, burble, sedition, sabotage, crackle, terrorism, sassafras, larceny, joyride, and espionage all have hung RPCs to allspice. Fortunately, we have more RPCs servers on allspice than we have sparcStations. I could find no message in allspice's 4meg syslog file pertaining to this file or block. The only way out that I can think of is to reboot allspice. By the way, until we reboot try to avoid compiling for the sun3 on a sun4 or doing ls commands /sprite/cmds.sun4. Mendel Log-Number: 30453 Date: Wed, 5 Dec 90 16:02:06 PST From: shirriff (Ken Shirriff) Subject: Assault crash Assault crashed again when it tried to reinitalize the LE chip and failed. Log-Number: 30454 From: mendel (Mendel Rosenblum) Subject: Allspice deadlock Date: Wed, 05 Dec 90 16:20:56 PST Allspice hung up with one of its patented consistency deadlocks. I pulled the network interface out and it limped thru recovery and came back to life. Mendel Log-Number: 30457 Date: Thu, 6 Dec 90 16:39:56 PST From: ouster (John Ousterhout) Subject: FYI >From karels@okeeffe.Berkeley.EDU Wed Dec 5 22:22:59 1990 >From: karels@okeeffe.Berkeley.EDU (Mike Karels) To: ouster@sprite.Berkeley.EDU, eric@mammoth.Berkeley.EDU Cc: culler@sprite.Berkeley.EDU, bks@okeeffe.Berkeley.EDU Subject: continuing network problems between mammoth and sprite Date: Wed, 05 Dec 90 17:55:56 PST Once again, we found mammoth sending about 300 packets/sec. to a Sprite workstation when emacs got hosed, causing rather ragged network response for the rest of the CS division. In this case, the culprit seemed to be cardamom, where David Culler was logged in. David, do you know what happened to your emacs window? Last time I complained about this (in August), John said that someone would be replacing the Sprite IP/TCP code within the next few months; has any progress been made? Has anyone looked at emacs to find out why it goes crazy? If no one does anything to fix this problem, I'll bring the issue up with the network committee. According to the EECS network policy, miscreant hosts can be disconnected from the network until there is reason to believe that the problems have been fixed. Mike Log-Number: 30464 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 8 Dec 1990 16:17:04 PST Subject: cc warnings I encountered the following warnings while compiling the kernel for the sun4c: sun4c.md/uword.c: In function read_iureg: sun4c.md/uword.c:80: warning: assignment between incompatible pointer types sun4c.md/uword.c:84: warning: assignment between incompatible pointer types ../Include/timer.h:234: warning: data definition lacks type or storage class devNet.c: In function DevNet_FsOpen: devNet.c:155: warning: `maxSize' may be used uninitialized in this function sun4c.md/devSCSIC90.c:618: warning: unused variable `ctrlPtr' sun4c.md/devSCSIC90.c: At top level: sun4c.md/devSCSIC90.c:851: warning: `PrintRegs' defined but not used fsconsistCache.c: In function Fsconsist_NumClients: fsconsistCache.c:1306: warning: return-type defaults to `int' sun4c.md/vmSun.c: In function VmMach_Init: sun4c.md/vmSun.c:685: warning: unused variable `segTablePtr' sun4c.md/vmSun.c: In function VmMach_NetMapPacket: sun4c.md/vmSun.c:2388: warning: unused variable `pageNum' sun4c.md/vmSun.c:2387: warning: unused variable `segNum' sun4c.md/vmSun.c: In function VmMach_DMAAllocContiguous: sun4c.md/vmSun.c:4222: warning: unused variable `initialized' sun4c.md/vmSun.c:4215: warning: `beginAddr' may be used uninitialized in this function sun4c.md/vmSun.c: In function VmMach_DMAFree: sun4c.md/vmSun.c:4361: warning: value computed is not used sun4c.md/vmSun.c: At top level: sun4c.md/vmSun.c:4870: warning: `VmMachTrap' defined but not used sun4c.md/vmSun.c:2561: warning: `FlushWholeCache' defined but not used Log-Number: 30465 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 8 Dec 1990 16:50:56 PST Subject: sun4 cc warnings I got the following warnings compiling for the sun4: sun4.md/uword.c: In function read_iureg: sun4.md/uword.c:80: warning: assignment between incompatible pointer types sun4.md/uword.c:84: warning: assignment between incompatible pointer types --- sun4.md/devJaguarHBA.o --- sun4.md/devJaguarHBA.c:324: warning: `GetJaguarMem' defined but not used devNet.c: In function DevNet_FsOpen: devNet.c:155: warning: `maxSize' may be used uninitialized in this function fsconsistCache.c: In function Fsconsist_NumClients: fsconsistCache.c:1306: warning: return-type defaults to `int' sun4.md/vmSun.c: In function VmMach_NetMapPacket: sun4.md/vmSun.c:2437: warning: value computed is not used sun4.md/vmSun.c: In function VmMach_DMAAllocContiguous: sun4.md/vmSun.c:4222: warning: unused variable `initialized' sun4.md/vmSun.c:4215: warning: `beginAddr' may be used uninitialized in this function sun4.md/vmSun.c: In function VmMach_DMAFree: sun4.md/vmSun.c:4361: warning: value computed is not used sun4.md/vmSun.c: In function VmMach_32BitDMAAlloc: sun4.md/vmSun.c:5117: warning: value computed is not used sun4.md/vmSun.c: In function VmMach_32BitDMAFree: sun4.md/vmSun.c:5170: warning: value computed is not used sun4.md/vmSun.c: At top level: sun4.md/vmSun.c:4870: warning: `VmMachTrap' defined but not used sun4.md/vmSun.c:2561: warning: `FlushWholeCache' defined but not used Log-Number: 30466 Date: Sat, 8 Dec 90 17:06:46 PST From: pmchen@ginger.Berkeley.EDU (Peter M. Chen) Subject: garlic death >From eklee@sprite.Berkeley.EDU Fri Dec 7 20:30:46 1990 >Subject: HW problems with garlic. > >Crashed a few minutes ago. >When I press the reset button garlic displays: >7..6..5..4..3..2..1.. >FAILURE > >Power cycling garlic results in the same behavior. > >Has anyone experienced this problem before? > >Ed > When I try to boot using tftp, it fails with a short read error: tftp()new: short read couldn't load tftp()new When I try to boot with mop, it gets farther, but then prints: SII: wait on CMD_PHASE failed Dev_SIIIntr: Bus reset!! Warning: SII# Target 0 LUN 0 reset and current command terminated. Horrible hardware death to another decstation or insidious assasination on a spice? You decide... Terry, I switched garlic with mustard (which most recently was in the CAD lab). So the broken machine is in the CAD lab. When the ds5000's come on line, I'll keep the name mustard here. Pete Log-Number: 30467 Date: Sat, 8 Dec 90 21:23:41 PST From: pmchen (Peter M. Chen) Subject: mail to allspice is down I can't mail from apathy (sunOS) to allspice. Apathy thinks allspice is down. Pete Log-Number: 30468 Date: Sun, 9 Dec 90 10:23:21 PST From: ouster (John Ousterhout) Subject: Migd global log cancerous? When I came in this morning the root disk was full, so I poked around to see what was causing the trouble. Among other things, /sprite/admin/migd/global-log was over 20 Mbytes. Does anyone know (a) if this file needs to be kept at all, (b) if not, how to stop migd from writing it, and (c) if so, how to at least truncate it? I also noticed some other things: 1. /tmp.old had about 20 Mbytes in it. I just deleted the whole directory (it appears to predate the first use of LFS for /tmp). 2. /sprite/admin/dump/restore had about 10 Mbytes, apparently from an old restoration. I deleted the restore directory subtree. 3. There was a 20-Mbyte file /dev/rxb1.nr, created about 8:00 this morning. I assumed that this file represents some sort of error, so I deleted it. 4. /sprite/boot/ds3100.md had over 20 Mbytes of kernels in it. I deleted the following ones: ds3100.KS.243, shirriff2, sync.new, sync (all owned by Ken, and all older than June 1) fred (created by fred in early September) Can everyone check this directory for old kernls and delete them? I make it a practice not to copy kernels into this directory, but just to leave symbolic links from there to my kernel working directory. I think this practice makes it easier to keep track of disk space usage, and it reduces the likelihood of leaving clutter in /sprite/boot. -John- Log-Number: 30470 Subject: Re: Migd global log cancerous? Date: Sun, 09 Dec 90 12:05:06 PST From: Mike Kupfer <kupfer> /sprite/admin/migd/global-log is the log from the global migration daemon. I think it should be kept around, because that's where error messages go. There are a couple things I can think of to make the file smaller. One is to reduce the logging/debugging level that migd is invoked with (currently 2). Another is to put something in, say, allspice's bootcmds like mv /sprite/admin/migd/global-log /sprite/admin/migd/global-log.old mike Log-Number: 30477 From: Fred Douglis <douglis@cs.vu.nl> Subject: Re: Migd global log cancerous? Date: Mon, 10 Dec 90 10:24:22 +0100 Moving the global log when allspice reboots is tricky, since the migration daemon might already have the old one open. of course, usually the act of rebooting allspice causes a new migration daemon to pop up, because the version numbers of the files change. You should be able to change the debugging level from 2 to 0 without ill effects. Fred Log-Number: 30471 Date: Sun, 9 Dec 90 12:05:07 PST From: ouster (John Ousterhout) Subject: LFS performance uneven? When I switched my Tk development directory over to /user5 this morning it felt like compiles were running slower than they used to, so I ran some tests of complete recompiles of Tk/Wish for the Sun-3, both in the LFS directory on /user5 and in my home directory on /user1. In both cases /tmp was on an LFS disk. Here are the results from several runs: OFS: 182.3u 50.5s 1:23 278% 184.1u 49.8s 1:17 301% LFS: 184.4u 53.1s 2:10 182% 181.8u 49.5s 1:17 299% 184.5u 52.5s 1:56 203% 183.7u 49.7s 1:11 327% It appears from these numbers that LFS performance is inconsistent, varying by as much as one minute (almost 50%). Could it be that cleaning is kicking in during the slow runs, and that this is the source of the inconsistency? -John- Log-Number: 30473 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 9 Dec 1990 16:01:29 PST Subject: /dev/rxb1.nr The creation of the 20 MB file "/dev/rxb1.nr" that John O. reported is due to a bug in the dump program. Dump was supposed to be opening "/dev/exb1.nr", as I've named the exabyte connected to allspice. Instead it opened up a different file and dumped to it thereby filling the root partition. At the very least the file should not be opened with the create flag set. John Log-Number: 30474 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 9 Dec 1990 16:03:53 PST Subject: retraction I'd like to retract my last bug report concerning /dev/rxb1.nr. My typing was at fault, not dump. I still think dump shouldn't create the file, however. John Log-Number: 30479 From: mendel (Mendel Rosenblum) Subject: Re: The story about anise/burble Date: Mon, 10 Dec 90 10:25:47 PST > Return-Path: mgbaker > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA79210; Sun, 9 Dec 90 17:24:48 PST > Message-Id: <9012100124.AA79210@sprite.Berkeley.EDU> > To: bugs > Subject: The story about anise/burble > Date: Sun, 09 Dec 90 17:24:45 PST > From: Mary Baker <mgbaker> > > Burble was trying to fork an "sh -ev". It was in the midst of trying to > do a Vm_SegmentDup and copy a page for the segment from its swap space > (VmCopySwapSpace). The src swapFilePtr for the src segment had a server ID > of anise, while the destination swapFilePtr had a server ID of allspice. > The link /swap/56 points to allspice. In Fsrmt_BlockCopy (called by > Fs_PageCopy) the Rpc_Call to do the RPC_FS_COPY_BLOCK command uses > the serverID in the src swapFilePtr. The server (anise in this case) then > executes Fsrmt_RpcBlockCopy for a file which is actually on allspice. This > routine does a FsrmtFileVerify on the handle, which returns NIL. > This ends up causing STALE_HANDLE to be returned from the Rpc_Call, which > causes Fs_PageCopy to decide the server is down and it waits for recovery > and retries the copy forever. It seems to me that there are a number of > problems here, including the confusion of serverIDs and the infinite retrying > of the access of a swap file which doesn't exist. We've moved the swap > directories back and forth between allspice and anise a couple of times now. > Maybe that's not supposed to happen often, but perhaps should try to get this > to work correctly anyway since it wouldn't be difficult. It's funny that this > is just happening now to burble, since its swap directory was moved to allspice > from anise quite a while ago. > > > Mary We should make sure that tve or someone else didn't try to move the swap directory. This problem is reported in the sprite log entry #30367 has happened everytime I`ve try to move a swap directory of a sun4 while it was running. Sometimes it took a long time to happen. An old shell with a single page swapped out will cause the problem next command you time at it. Another possibility that occurred to me it an iteraction with migration. Was the forking process migrated there? Was the swap file being copied in "56" or some other swap directory? What happens if a shell migrates from a machine with a swap directory on anise to a machine will a swap directory on allspice and then trys to fork? The swap file for the shell will reside on anise but the newly created data and stack segments for the shell will reside on allspice. 1) This would happen on sun4s because its only possible if copy-on-write is turned off. 2) It also wouldn't happen frequently because we infrequently migrate processes with vm segments active. Most migration uses remote exec which should not (does not?) create swap files. 3) Should be 100% reproducible with a simple test program that migrates and forks() to a swap with a different swap server. Mendel Log-Number: 30476 Subject: arson debugger loop Date: Sun, 09 Dec 90 21:09:05 PST From: Mike Kupfer <kupfer> I found arson in a loop, continuously printing the following two lines on the console. I couldn't put it into the debugger, and unfortunately I don't know what kernel it was running. MachKernelExceptionHandler: Address error on load: addr: 17 PC: 800a22b0 Entering debugger with a TLB load address error exception at PC 0x800a22b0 mike Log-Number: 30478 Date: Mon, 10 Dec 90 09:06:51 PST From: ouster (John Ousterhout) Subject: Allspice crash Allspice died late last night with the same old poison-packet problem. This time the culprit was chisum, a ds3100 over in Cory. I put chisum in the debugger while rebooting Allspice. -John- Log-Number: 30480 From: mendel (Mendel Rosenblum) Subject: Larceny dies with floating point trap error Date: Mon, 10 Dec 90 11:13:15 PST Larceny died because of a problem with the interactions between the low level debugging and floating point support of the sun4. Someone using Michael E. Hohmeyer's account was debugging a program that did naughty floating point opeations causing IEEE traps. The user set a breakpoint after the bad operation had started and before its result was used (ie between the "fadds" instruction and the "stf" instruction). This meant the kernel was entered with a debugger trap and with a floating point trap waiting to happen. The problem is that the kernel handles the debugger trap and then returns to the user without checking for and handling the floating point trap. This causes the floating point unit to get upset and report a sequence error causing Sprite to panic. Until this problem gets fixed I would advise: 1) Avoid debugging floating point programs on the sun4. or 2) Avoid naughty floating point operations. Using doubles rather than singles may help. Mendel ps, The routine causing the error was named: "calculate_tangent__FP6vertexPP17surface_embeddingPP7surface" Maybe the original Unix linker's limit of routine names was a good idea. Log-Number: 30481 Subject: problems with migd on roar Date: Mon, 10 Dec 90 11:41:45 PST From: Mike Kupfer <kupfer> The migd running on roar thought it should be the master, even though there seemed to be a valid master running on joyride. The net result was that you couldn't do anything using the migd on roar ("uptime", "make", etc.). I tried killing the migd and restarting it; that didn't help. I ended up zapping /sprite/admin/migd/pdev, and everyone seems to be happy now. mike Log-Number: 30482 From: mendel (Mendel Rosenblum) Subject: Bug in Mx tag lookup Date: Mon, 10 Dec 90 16:00:30 PST In /sprite/src/kernel/proc when I type "mx -t Proc_GetPCB" I get a notifier that looks like: |------------------------| | ____________ | | | Continue | | | ------------ | |------------------------| | bad pattern "^#define | | Proc_GetPCB(pid) | | (proc_PCBTable[pid & | | PRO": Premature end of | | regular expression | -------------------------- Mendel Log-Number: 30483 Subject: another unprintable document? Date: Mon, 10 Dec 90 16:05:54 PST From: Mike Kupfer <kupfer> Can someone print /sprite/doc/pmake/tutorial.t? The invocation is supposed to be "lpr -h -n tutorial.t". The Laserwriter in 608-2 seems to be doing something (lots of flashing lights), but eventually lpq says "Not responding for 1 minutes." and eventually the job goes away. No paper ever appears. mike Log-Number: 30484 Date: Mon, 10 Dec 90 18:09:19 PST From: sethg (Seth C. Goldstein) Subject: I can't print from roar I get message: ginger.Berkeley.EDU: /usr/lib/lpd: Your host does not have line printer access Log-Number: 30492 Subject: Re: I can't print from roar Date: Tue, 11 Dec 90 13:10:26 PST From: Mike Kupfer <kupfer> I added roar to hosts.equiv on Ginger. Somebody apparently forgot to do this when they set up roar. (See step 7 of /sprite/admin/howto/addNewHost). mike Log-Number: 30485 Date: Mon, 10 Dec 90 18:13:20 PST From: shirriff (Ken Shirriff) Subject: Shutdown doesn't sync disks! As I suspected, shutdown does not successfully flush the cache. I verified this on a sun3 running new and on a ds3100. I believe the problem is the following code: CacheWriteBack(writeBackTime, blocksSkippedPtr, writeTmpFiles) ... /* * Wait for all block cleaners to go idea before returning. */ while ((numBackendsActive > 0) && !sys_ShuttingDown) { (void) Sync_Wait(&writeBackComplete, FALSE); } } I don't understand the cryptic comment, but this code waits until all the writebacks are complete. But if we're shutting down, it skips the wait! Mendel, do you know what this code should do, so I can fix it? This bug could explain some of the mail file trashing we've encountered, as well as various annoying things that happen to me after reboots. Ken Shirriff shirriff@sprite.Berkeley.EDU Log-Number: 30486 Date: Mon, 10 Dec 90 20:28:49 PST From: gibson (Garth Gibson) Subject: Clove meltdown Ann's machine (clove I think) is going into the debugger twice a second right now. It seems to through the line "ICMP Echo" at the bottom of the screen after each entry to the debugger then it overwrites this message with the next entering debugger message: MachKernel Exception Handler: Address error on load: addr: 17 PC: 8009bc90 Entering debugger with a TLB load address error ..... garth ps I left it doing this Log-Number: 30487 Date: Mon, 10 Dec 90 22:26:13 PST From: pmchen (Peter M. Chen) Subject: X server dies When running viewlogic on giverny (in Cory) to Evans, my X server goes into the DEBUG state. It opens the window fine, but dies after the first mouse event. Pete Log-Number: 30488 Date: Mon, 10 Dec 90 22:32:53 PST From: pmchen (Peter M. Chen) Subject: followup on X server dying The application (viewlogic) was running remotely on a sparcstation (giverny) in Cory. The local (displaying) machine was mustard (a ds3100). When the same thing was tried with the displaying machine being espionage (sparcstation), it worked. Pete Log-Number: 30491 Subject: Re: followup on X server dying Date: Tue, 11 Dec 90 11:48:11 PST From: Mike Kupfer <kupfer> This is probably the byte-swapping death that Mario Silva was experiencing a couple months ago. I know where to make the server fix, I just haven't gotten around to it yet. I'll bump it up in my Todo list. By the way, if it is the same bug, the client is not totally blameless. It is generating a bogus length value--see ~kupfer/x_byteswap_bug. mike Log-Number: 30489 Subject: migration problem? Date: Tue, 11 Dec 90 01:36:24 PST From: Mary Baker <mgbaker> Has anyone else received the following error while trying to compile stuff? This was on a sun4c. Warning: SigMigSend:Error trying to signal 11 to process 1355b (2492c on host 73): the specified process's user ID does not match the current process's uid Mary Log-Number: 30493 Subject: Rpc daemon timeout queue entry/reclaim servers bug Date: Tue, 11 Dec 90 18:57:24 PST From: Mary Baker <mgbaker> There's good and bad news. I've stuck in a fix to keep the rpc daemon time out queue entry from being rescheduled when it hasn't been descheduled. John Hartman's debug tracing found this bug for us. But there is nothing in the code to keep RpcReclaimServers from being called more often than it should be, not giving clients enough of a chance to use their channels to the rpc servers. This is already possible in our system and doesn't appear to be a problem, but we should keep an eye on it and also put in a fix at some point. The first problem was more critical, though. Mary Log-Number: 30494 Subject: stale handle on swap file Date: Tue, 11 Dec 90 19:02:20 PST From: Mary Baker <mgbaker> I put in a fix so that getting a stale handle on a swap file won't put you into infinite recovery - it will instead return a swap error. My question is whether we want to do the same thing in Fs_PageRead. Here too, if it gets a stale handle it goes into retrying recovery. Why would we want to do that on a stale handle there? Am I misunderstanding the meaning of a stale handle? Mary Log-Number: 30497 From: mendel (Mendel Rosenblum) Subject: Re: stale handle on swap file Date: Wed, 12 Dec 90 10:34:12 PST > > > I put in a fix so that getting a stale handle on a swap file won't put you > into infinite recovery - it will instead return a swap error. My question is > whether we want to do the same thing in Fs_PageRead. Here too, if it > gets a stale handle it goes into retrying recovery. Why would we want to > do that on a stale handle there? Am I misunderstanding the meaning of a > stale handle? > > Mary I believe that the problem here is not the going thru recovery when a stale handle error is returned but returning stale handle error when recovery won't repair the problem. The BlockCopy RPC could return stale handle and have recovery fix everything. For example, consider the case that the block copy is the first RPC following a network partition that the server detected but the client didn't. The BlockCopy would return stale handle because the server had cleaned up the client state. Recovery would re-establish the state and the BlockCopy would be retry successfully. The correct thing is to start recovery when the server returns stale handle. The problem here was the server side of the block copy RPCs should return illegal argument (src and/or dst not local files) rather stale handle in this case. By convention in Sprite, the server-side stubs bindly trust anything pasted to them so it probably be in keeping with tradition if Fsrmt_BlockCopy was modified not do the RPC if the src and dst aren't on the same machine. As currently modified, the code in Fsrmt_BlockCopy() does a Fsutil_WantRecovery() yet its caller in Fs_PageCopy() no longer does a Fsutil_WaitForRecovery(). I don't understand this Want/Wait stuff well enough to know if this will cause problems. It occurs to me that the easiest way to fix this problem is to modify Fs_PageCopy to call Fs_PageRead and Fs_PageWrite if the src and dst aren't on the same machine. This would fix the problem without introducing some three machine RPC that a higher performance solution would. Mendel Log-Number: 30496 From: Fred Douglis <douglis@cs.vu.nl> Subject: more distribution bugs: booting clients Date: Wed, 12 Dec 90 14:23:15 +0100 wonder of wonders, i finally got a second sun 3/60 to run sprite. i finally booted it, but only after several problems: 1) the addhost program relies on spritehosts being RCS'ed, yet there's no RCS'ed version. at least, not here. perhaps it should be more robust about the existence of one. same for hosts.equiv and various others. addhost checks to see if the file is checked out but not to see if it's RCSed in the first place. 2) the distribution instructions, addhost, and howto/addNewHost, are all very berkeley-specific. they talk about ginger, etc. they don't talk about what to do to boot a diskless client from the sprite root file server. 3) the support for diskless clients on the distribution was totally nonexistent. i had to ftp stuff from berkeley. - /sprite/boot contained a directory, "TM.md," with a file called "sun3" that was 8 megabytes long. not likely to be a real kernel. i moved TM.md to sun3.md and copied /vmsprite to sun3.md/sprite. i copied netBoot from berkeley. - i had to dredge up varied arcane knowledge about booting a diskless workstation and about booting sprite. saying b le() didn't work. saying b le(,42)sun3.md/sprite did. note for whoever might revise the instructions: at berkeley, we use the last two octets of the internet address (e.g., 961b). here i had to use only the last octet. 4) addhost had a couple of more major problems. first, it didn't convert the internet address into upper case for /sprite/boot. i presumed it would if needed. second, it made /sprite/boot/c01fe78 instead of c01fe708. - Fred Log-Number: 30498 Subject: ds3100: "page number offset out of page table" Date: Wed, 12 Dec 90 12:22:23 PST From: Mike Kupfer <kupfer> ------- Forwarded Message Date: Wed, 12 Dec 90 02:42:43 PST >From: sethg (Seth Copen Goldstein) To: root@sprite.Berkeley.EDU Subject: roar crashed at 3am with following: Page number offset out of page table sprite version 1.075 (ds3100) 11 sep 90 debug at address 0x800c39cc also, F1-A did nothing, had to reset machine. what did I do wrong? ------- End of Forwarded Message Log-Number: 30499 Subject: ds3100: reserved instruction Date: Wed, 12 Dec 90 12:23:06 PST From: Mike Kupfer <kupfer> ------- Forwarded Message Date: Wed, 12 Dec 90 11:29:11 PST >From: sethg (Seth Copen Goldstein) To: root@sprite.Berkeley.EDU Subject: Twice in three hours: system crashed with reserved instruction 4:00 am MachKernelExceptionHandler: Resererved Instruction Entered debugger With a reserved instruction exception at pc=0x8ea80570 Do you want this info? What should I do during daylight hours? Help - I am trying to run a 40hour long simulation and the crashes are killing me! ------- End of Forwarded Message Log-Number: 30500 From: mendel (Mendel Rosenblum) Subject: Anise crash Date: Wed, 12 Dec 90 15:23:07 PST Anise deadlocked today so I took a core dump and rebooted it. My best guess as to what happen was that the timer call back queue wasn't being processed. This meant that everything that depended on timer callbacks quit working. I have know idea what caused this problem. Mendel Log-Number: 30502 From: mendel (Mendel Rosenblum) Subject: /hosts/{assault allspice anise}/bootcmds does exit Date: Wed, 12 Dec 90 17:01:09 PST I was adding the command to /hosts/anise/bootcmds to redirect the syslog into a file so I looked how it was done on allspice and assault. It was done by adding the command: newtee -inputFile /dev/syslog /sprite/syslogs/$HOST to the end of /hosts/<hostname>/bootcmds. Since newtee doesn't exit, this means that the bootcmds script never exits. Was this done for a reason? Does it cause any problems for the boot script not to exit? Mendel Log-Number: 30503 From: mendel (Mendel Rosenblum) Subject: gdb.new inserts ^P over rlogin connections Date: Wed, 12 Dec 90 17:56:54 PST When I used gdb.new while rlogin'ed into a Sprite machine I get "^P" characters inserts in different random places. The problem doesn't occur when I uses telnet. I suspect it is a problem in "rlogin" or "rlogind". The new gdb appears to use much more and exotic ioctl's on stdin than the old one. Also, if you type control-Z in gdb.new get the message "ioctl: bad command TIOCSTART". It appears to work correct. Mendel Log-Number: 30504 Subject: chown of dev on murder failed Date: Wed, 12 Dec 90 18:34:46 PST From: Mike Kupfer <kupfer> I tried to make root the owner of one of the Exabyte devices on murder. sage-1# chown root /hosts/murder/dev/exabyte The chown apparently succeeded, but it provoked some other problems. I got the following in my syslog: <setIOAttr> 12/12/90 18:26:09 murder (17) RPC timed-out Fsrmt_SetIOAttr failed <30002>: device <5,80> at server 17 and there are seven messages RpcResend: RPC 23, client 33, RPC seq # 1fda9, forgot reply? (all the same) on murder's console. mike Log-Number: 30505 Subject: Re: chown of dev on murder failed Date: Wed, 12 Dec 90 18:36:56 PST From: Mike Kupfer <kupfer> (Oops, I wrote the subject line before I did an "ls" to see whether the chown really worked.) /mike Log-Number: 30506 Subject: TCP problem found, not sure about fix Date: Wed, 12 Dec 90 20:41:32 PST From: Mike Kupfer <kupfer> I found our end of the TCP problem. The TCP input code is supposed to check whether the socket is still in use when it receives data. If not, it's supposed to send a RESET to the other end. Our function Sock_HasUsers checks the reference count on the socket data structure. The bug is that Sock_Close doesn't decrement the reference count. The code to do it is there, but it's commented out. I suppose I could just uncomment that line of code, but I'm nervous about the potential side effects. This is the RCS log line for the relevant changes: revision 1.17 date: 89/08/10 16:15:59; author: douglis; state: Exp; lines added/del: 63/77 JKO changes for ipserver duplicate frees, etc. So, John or Fred: is the line commented out because it was thought to be a no-op? mike Log-Number: 30507 Date: Thu, 13 Dec 90 09:15:09 PST From: ouster (John Ousterhout) Subject: Re: TCP problem found, not sure about fix Thanks for tracking this down, Mike! I took a look at the code to try to remember why the following line is commented out: /* sharePtr->clientCount--; */ First of all, the rlog comment "JKO changes for ipserver duplicate frees, etc." refers to a nasty problem we had with our ipServers for a long time, whereby they used memory that had been freed, causing storage corruption and crashes. If you think our ipServer reliability is bad now, you should have seen it in the old days. Anyhow, I found the bug and fixed it in sockOps.c version 1.17. However, I don't think that the bug explains the line you found commented out. The actual bug fix is some new code at about line 811 of sockOps.c: /* * Transfer the pdev request buffer from the old socket * to the new one, so it can be freed properly when the * new socket is deleted. */ connSockPtr->reqBufSize = privPtr->sharePtr->reqBufSize; connSockPtr->requestBuf = privPtr->sharePtr->requestBuf; I only vaguely remember the problem, but I believe it had to do with the wrong pdev buffer being freed at certain times. It doesn't seem like me to make a bug fix by commenting out a line of code without any additional comments to explain why. In addition, I also see that version 1.16 does not have the line at all. In other words, the file went from having nothing there to having a line commented out. There was never a version of the file that had the code in uncommented form. I suspect that what happened is I noticed that the reference count was never getting decremented, added the line, discovered that the ipServer didn't work correctly any more, commented the line out, saw that the ipServer worked again, and forgot to remove the commented line before checking the file in. The right thing to do is probably to un-comment the line and see what breaks. I suspect that something will break, but I don't remember what. -John- Log-Number: 30508 Date: Thu, 13 Dec 90 12:07:17 PST From: sethg (Seth Copen Goldstein) Subject: arson and pride crashes (maybe they were me?) I came in this morning to find that arson and pride had crashed. I had been running my simulations on them, so I might be the culprit. If you think that it is so then I will be happy to provide you with my simulation so that you can see where it happened. No rush, I found other machines to do the simulation on. seth p.s. I had also miged out a process to somewhere else which died, but I can't remember which machine. Log-Number: 30509 Date: Thu, 13 Dec 90 13:29:39 PST From: pmchen (Peter M. Chen) Subject: arson and pride crashes I have simulations which are also prone to generating horrible crashes when they run for a long time. This is only on ds3100's. We've never tracked down the problem, though I have a script which will repeat the crash every time, at the exact same place. After the machine crashes, the screen goes blank, and the lights at the back of the machine are all on. The machine does not respond to F1-A, and I think we had to either power cycle it or hit the reset button. Maybe even the reset button didn't work. I forget. Does this sound like yours? Anyway, Mendel had a program which would crash both under sprite and ultrix. We suspected that it might be a hardware problem, but we never verified by running my program on dill (an ultrix ds3100). Maybe your problem is the same--try running it on dill. Pete Log-Number: 30510 Date: Thu, 13 Dec 90 13:32:44 PST From: sethg (Seth Copen Goldstein) Subject: arson and pride crashes The simulation runs fine under ultrix. It also seems that these also crash at the same place (the log files end up being the same size). The screen does not always go blank (2 out of three times it didn't). However, F1-A does not work and reset is required. seth Log-Number: 30514 Date: Thu, 13 Dec 90 15:09:23 PST From: gibson (Garth Gibson) Subject: why is raid1 so slow ? raid1 64> ps -au | m USER PID %CPU %MEM SIZE RSS STATE TIME PR COMMAND root 74d11 28.6 0.1 128 112 READY 87:02 /sprite/daemons/cron root 34d13 1.9 0.4 560 480 RWAIT 158:46 /sprite/daemons/ipServer gibson 74d24 0.9 0.2 256 248 WAIT 0:06 -csh root e4d38 0.8 0.1 256 152 READY 0:00 more gibson e4d37 0.7 0.1 240 160 RUN 0:00 ps -au root 74d22 0.6 0.1 168 168 RWAIT 0:08 rlogind root 84d0f 0.0 0.1 216 152 RWAIT 0:31 /sprite/daemons/migd -D ... root 34d1d 0.0 0.1 152 136 RWAIT 0:07 /sprite/daemons/inetd ... root 74d16 0.0 0.0 224 0 RWAIT 0:03 /sprite/daemons/lpd root 84d23 0.0 0.1 168 104 WAIT 0:01 login -h ... root 64d15 0.0 0.1 320 144 RWAIT 0:02 sendmail -bd root d4d2f 0.0 --- --- --- READY 0:00 /sprite/daemons/cron root 24d28 0.0 0.1 168 104 RWAIT 0:02 /sprite/cmds.$MACHINE/lo... root 14d0b 0.0 --- --- --- EXIT 0:00 cmds/initsprite -b ... raid1 65> more /hosts/raid1/crontab #5 8,11,14,17,20 * * * root /c/stats/RAW # at 3 Garth's ginger to raid rdist runs 0 4 * * * eklee /users/eklee/bin/chksum/at.script 0 5 * * * eklee /users/eklee/bin/paritycheckraid raid1 66> uptime raid1 sun4 up 3+16:54 inuse 2.85 2.95 2.83 (3+16:54) It seems to take 10-20 secs to do ls on a sprite disk. garth Log-Number: 30516 Subject: Re: why is raid1 so slow ? Date: Thu, 13 Dec 90 15:13:28 PST From: Mary Baker <mgbaker> Right now raid1 is going through an infinite recovery loop with allspice. I will see why and try to fix it. I may have to reboot but will tell you so if I do. Mary Log-Number: 30517 Date: Thu, 13 Dec 90 15:17:35 PST From: pmchen (Peter M. Chen) Subject: re-sent mail Is anyone else getting mail resent from several days ago? This just happened about 3 times in the past 5 minutes. Pete Log-Number: 30520 From: mendel (Mendel Rosenblum) Subject: Fs_Select broken in new kernel Date: Thu, 13 Dec 90 15:53:50 PST The Fs_Select system call in the new kernel returns 0 rather than FS_TIMEOUT when the timeout value is exceeded. This causes a panic() in the c library routine file socket.c when a connect or accept request timesout. This causes most network programs (rsh, rlogin, telnet) to enter the debugger when the specified host is down. For example: jaywalk% sysstat -v jaywalk SPRITE VERSION 1.079 (sun4c) (11 Dec 90 13:58:16) jaywalk% rsh lust Wait (socket.c): Fs_Select returned 0 ready Debug jaywalk% Mendel Log-Number: 30531 Subject: Re: more stuck mail on allspice (whining) Date: Sat, 15 Dec 90 16:14:27 PST From: Mike Kupfer <kupfer> > You send the message, and sendmail delivers it to everyone but the > down machine. Well, what should happen (I confirmed this with Keith Bostic) is that sendmail leaves it in the queue, marked with which recipients still need a copy, and exits. The "root" sendmail on allspice checks the queue periodically and (eventually) either delivers to the remaining recipients or bounces the mail back to the sender. At any rate, I figured out why we're getting this sudden rash of problems: it's our friend the select() bug. I even watched it happen: sendmail tries to do a connect() to the mailer on a down host, which panics because of the select() bug (see the appended stack backtrace). Sendmail doesn't clean up, so the message is left locked, and the recipient list isn't updated. I'm currently running the "root" sendmail on sage, which is running a kernel with Ken's select() fix, and it seems to be dealing with previous problem cases. Unfortunately, this doesn't fix the problem for the entire system. We can (1) limp along until the next kernel install, manually unjamming the mail queue as necessary (2) hack sendmail or libc to recover when connect() fails (3) reconfigure sendmail so that only the "root" sendmail actually delivers mail. What do people think is the right thing to do? When is the next kernel install planned? mike -- #0 0x24ea0 in Sig_Send () #1 0x25b38 in panic () #2 0x23bb4 in shutdown () #3 0x22f1c in connect () #4 0x4294 in makeconnection (...) (...) #5 0x53dc in openmailer (...) (...) #6 0x12960 in smtpinit (...) (...) #7 0x4db4 in deliver (...) (...) #8 0x63ec in sendall (...) (...) #9 0xd690 in dowork (...) (...) #10 0xd0b8 in runqueue (...) (...) #11 0xa7cc in main (...) (...) Log-Number: 30523 Subject: division by 0 kills Emacs Date: Thu, 13 Dec 90 21:17:36 PST From: Mike Kupfer <kupfer> (/ 0 0) puts Emacs into the debugger. It should generate a complaint about an arithmetic error. Log-Number: 30529 From: mendel (Mendel Rosenblum) Subject: Re: xmh dies from not tracking directories Date: Fri, 14 Dec 90 14:06:53 PST > Return-Path: kupfer > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA336179; Fri, 14 Dec 90 13:35:04 PST > Message-Id: <9012142135.AA336179@sprite.Berkeley.EDU> > To: bugs > Subject: xmh dies from not tracking directories > Date: Fri, 14 Dec 90 13:35:03 PST > From: Mike Kupfer <kupfer> > > xmh randomly dies with messages like > > xmh: Error in FOpenAndCheck(/users/kupfer/Mail/drafts/.xmhcache, r) > errno = 2; no such file or directory > exiting. > > This usually happens when I bring up a new composition window. (This > past time it happened when I brought up a window to compose a new > message, then brought up a second window to compose a reply.) > > I assume this bug is related to xmh's failure to notice changes caused > by external sources (e.g., reading mail at home via Emacs). > > mike I've seen this bug reported from people running on Unix. This means it probably not a Sprite bug. Mendel Log-Number: 30530 Subject: dump doesn't fail gracefully Date: Fri, 14 Dec 90 15:54:26 PST From: Mike Kupfer <kupfer> The daily dumps failed last night, apparently due to a media error. Unfortunately, they didn't fail cleanly--new dumps tried to start up and then hung. Doing ls on the Exabyte hangs, too. Creating a new file for the same device doesn't work, so I guess the hang is at a fairly low level in the system. Here's the message from allspice's syslog: Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: hardware error - info bytes 0x0 0x0 0x0 0xf8 Warning: Exabyte Tape Motion error Warning: Exabyte 8200 at SCSI3#2#2 Target 5 LUN 0 error: media error - info bytes 0x0 0x0 0x0 0xf8 Exabyte File Mark Error The messages in the dump log are tar.gnu: can't write to - : I/O error line = 651 Received SIGPIPE signal, terminating abnormally SIGPIPE: tar exited with code = 0x3 Dump: tar exited with nozero status: 3: invalid argument Dump: Received SIGPIPE signal, terminating abnormally: I/O error opening /dev/exb1.nr as archive file rewinding tape ... done rewinding tape. reading tape label rewinding tape ... done rewinding tape. Using tape #54 TapeLabel=|SPRITE DUMP TAPE #54 (etc.) Log-Number: 30539 Date: Sun, 16 Dec 90 13:39:35 PST From: shirriff (Ken Shirriff) Subject: /tmp problems When I compile I get: cc: Error: Can't create output file: /tmp/ctmpa68196 : No such file or directory This seems to only happen with migration (pmake -X works). I think we had this problem before and it was a locked handle in /tmp, but I don't remember how it was fixed. Ken Log-Number: 30540 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sun, 16 Dec 1990 13:44:07 PST Subject: Re: /tmp problems This is probably related to the fact that /tmp is no longer a remote link, but a directory. Perhaps some of the machines out there are confused. I'll try and delete the prefix on all up hosts. John Log-Number: 30541 Date: Sun, 16 Dec 90 16:48:31 PST From: tve (Thorsten von Eicken) Subject: nfsmount /home/gingre/sprite was in DEBUG Log-Number: 30542 Subject: fenugreek died with a deadlock Date: Sun, 16 Dec 90 21:45:18 PST From: Mike Kupfer <kupfer> I found fenugreek in the monitor. It apparently went into the debugger, and someone found it with the video off and tried L1-A. I couldn't put it back into the debugger, so all I can tell you is what was on the console. It was running the 1.079 kernel. Fatal Error: Deadlock!!!(netRouteMutex @ 0xe09caf0) Holder PC: 0xe05c5c0 Current PC: 0xe05cdf0 Holder PCB @ 0xe258eac Current PCB @ 0xe0c64fc Error type 47 while syncing disks. Entering debugger... mike Log-Number: 30543 Date: Mon, 17 Dec 90 09:34:01 PST From: ouster (John Ousterhout) Subject: Jaywalk reboot When I came in this morning I had difficulty doing migrated compilations: I kept getting "*** Error code 5" messages that aborted the make. I tracked the problem down to jaywalk: everything migrated to jaywalk was getting this error. I figured jaywalk must still have something stale from the big reboot on Saturday so I put it into the debugger and attempted to reboot it remotely. Unfortunately I gave it the wrong boot string, so it didn't reboot correctly. In any case, this fixed the problems with pmake. -John- Log-Number: 30544 Date: Mon, 17 Dec 90 11:21:41 PST From: mendel (Mendel Rosenblum) Subject: pmake error message If you type make or pmake in a directory with no Makefile you get the error message: jaywalk% make --- .BEGIN --- you cannot compile for a ds3100 on this machine exit 1 *** Error code 1 make: 1 error jaywalk% This is kinda confusing. Mendel Log-Number: 30545 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 17 Dec 1990 11:37:42 PST Subject: allspice crash Allspice crashed last evening. It would not respond to the console or the network. It was not in the debugger so I couldn't debug it. I had to use the watchdog reset button. I looked at the first register window for addresses that were in the text segment. Here is what I found. tbr 0xf6007060 reserveSpace i7 0xf608ef70 FreeIndirectBlock l3 0xf6007060 reserveSpace o2 0xf60f94cc partFreeListHdr o7 0xf60396fc FscacheUnlockBlock The rest of the registers didn't have anything interesting. John Log-Number: 30546 Date: Mon, 17 Dec 90 12:00:21 PST From: dedood (Paul de Dood) Subject: Finger information For the past several days I've been trying to run chfn to change my finger information, but I keep getting: chpass: password file busy -- try again later. I'm not just being unlucky, am I? Thanks, Paul. Log-Number: 30547 Date: Mon, 17 Dec 90 12:15:33 PST From: shirriff (Ken Shirriff) Subject: Re: Finger information The lock file (/etc/ptmp) for the password file was still around from Tuesday, for some reason. I deleted it. chfn should now work. Log-Number: 30548 Date: Mon, 17 Dec 90 16:02:18 PST From: eklee (Edward K. Lee) Subject: my mail's disappeared!!! Espionage was having trouble with allspice so I rebooted espionage and when I checked my mail, there was only one message there. I haven't overridden my mail file incase you want to look at it. (Before rebooting espionage I was trying to read my mail, but it was so slow that I quit (with ^D). Then I L1-A'ed (I forgot to sync beforehand).) Could someone look at this as soon as possible? I really need my mail. Thanks, Ed P.S. here's what's in my mail file. ----- >N 1 pattrsn@peppeFrom da Fri Dec 7 18:02 20/773 "Files in lost+found" & 1 Message 1: >From pattrsn@pepper.Berkeley.EDU Fri Dec 7 18:02:44 1990 Date: Fri, 7 Dec 90 18:01:29 PST >From: pattrsn@peppeFrom daemon Mon Dec 17 15:29:55 1990 Date: Mon, 17 Dec 90 15:26:22 PST >From: root (The Sprite God) To: root Subject: Files in lost+found You have files in the following lost+found directories. These files were recovered during reboot. Please examine the following directories and recover or delete your files. //lost+found & ---- Log-Number: 30554 Date: Mon, 17 Dec 90 19:12:31 PST From: shirriff (Ken Shirriff) Subject: Re: my mail's disappeared!!! I modified the mail program (Mail) to do a fsync() after rewriting the user's mail file. This should help prevent Ed's problem from happening again. Ken Log-Number: 30551 From: mendel (Mendel Rosenblum) Subject: Allspice crash report Date: Mon, 17 Dec 90 17:35:18 PST Allspice hung up this afternoon and wouldn't respond to any external stimulus short of the watch dog reset button. The last console messages before the crash involved "Reinit recv unit"s and recovery. The machine was being pounded by a 200 megabyte process on treason. At the watchdog reset, the PC was at: 0xf6039130 <Fscache_FetchBlock>: save %sp, -128, %sp and the last several stack frames looked like: 0xf608edb8 <FetchIndirectBlock+504>: call 0xf6039130 <Fscache_FetchBlock> 0xf608eabc <MakePtrAccessible+92>: call 0xf608ebc0 <FetchIndirectBlock> 0xf608e6ec <OfsGetFirstIndex+284>: call 0xf608ea60 <MakePtrAccessible> 0xf608a0f0 <Ofs_BlockAllocate+264>: call 0xf608e5d0 <OfsGetFirstIndex> 0xf6038290 <Fsdm_BlockAllocate+168>: mov %l6, %o5 0xf603dd6c <Fscache_Write+484>: call %l0 0xf6044d3c <Fsio_FileWrite+404>: call 0xf603db88 <Fscache_Write> The last trap in the TBR registers was 0x050 or window overflow. It was like it was stuck in an infinite window overflow loop. Next time this happens, the person looking at it should record the last stack pointer %o6 or %sp values. Mendel ps. Less we think the problem has disappeared, /mic got a SCSI bus DMA error during fscheck's read of a descriptor block. Log-Number: 30552 Subject: tcsh ^D depends on command line? Date: Mon, 17 Dec 90 18:16:13 PST From: Mike Kupfer <kupfer> If I type ^D to tcsh, to ask it for the possible file name completions, the answer I get back depends on what I typed previously in the line. This seems wrong, and it's unlike other shells I've used that have file name completion. mike -- sage% cd /sprite/src sage% foreach d ( a^D a2p ali anno ar.new asplosstat atrm addhost alias appres ar.old at awk aid aliases aquarium as atobm alarm alloc ar as.old atq sage% ls a^D admin/ adobecmds/ attcmds/ Log-Number: 30553 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 17 Dec 1990 18:19:57 PST Subject: Re: tcsh ^D depends on command line? tcsh also does command completion, which is what you are seeing in the first example. You can argue that it shouldn't be doing command completion inside of a list like that, but perhaps they want it to work that way. John Log-Number: 30556 From: mendel (Mendel Rosenblum) Subject: Anise crashed with RpcDeamon bug Date: Tue, 18 Dec 90 16:56:06 PST Anise crashed when it tried to reinsert the RpcDaemons timeout queue entry on the callback list. This bug has been fixed. Thanks, Mendel Log-Number: 30559 Date: Wed, 19 Dec 90 08:48:30 PST From: ouster (John Ousterhout) Subject: Anise crash When I came in this morning Anise was in the debugger with the message "HandleRelease, handle <1,42,2,836457> "bit" not locked". I rebooted it, although it occurs to me in retrospect that I probably could have just continued it. If there had been instructions on the machine for how to take a core dump with kgcore I would have done it, but there weren't so I didn't. Given the advent of the holiday season and the disappearance of many of the Sprite maintainers, how about updating the instructions on both Anise and Allspice and adding instructions to Assault, if there aren't any there already? Mendel, since you're the author of kgcore, can you take care of this? Thanks. -John- Log-Number: 30560 From: mendel (Mendel Rosenblum) Subject: Re: rcp bug report Date: Wed, 19 Dec 90 10:02:22 PST > Return-Path: krste@ICSI.Berkeley.EDU > Received: from icsib13.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) > id AA659016; Wed, 19 Dec 90 08:46:17 PST > Received: by icsib13.Berkeley.EDU (4.1/SMI-4.0) > id AA08380; Wed, 19 Dec 90 02:07:19 PST > Date: Wed, 19 Dec 90 02:07:19 PST > From: krste@ICSI.Berkeley.EDU ( Krste Asanovic) > Message-Id: <9012191007.AA08380@icsib13.Berkeley.EDU> > To: root@sprite.Berkeley.EDU > Subject: rcp bug report > > The following command line causes rcp to give a segment violation > > rcp icsib13:somefile :: > > Sometimes it faults straight away, other times if you suspend it, then > background it, it dies as well. > > Krste > P.S. I use tcsh. The segment fault was caused by the "rcp" problem doing a strlen() call on a NULL pointer. This "works" on a VAX running BSD because the address 0 is readable and contains zero. This doesn't work on most other systems because it a stupid idea to have NULL accessible. If patched rcp and reinstalled it so it doesn't crash anymore. Thanks for the bug report. In the future, you might consider sending bug reports like this one to "bugs@sprite" rather than "root@sprite". Mail to the "bugs" alias is less likely to be ignored because it is logged and discussed at every Sprite meeting. Mendel ps. I looked on okeefe and this has been fixed in the 4.4 source tree. Log-Number: 30563 From: mendel (Mendel Rosenblum) Subject: Re: msgs problem Date: Wed, 19 Dec 90 16:07:49 PST > Return-Path: bmiller > Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA340530; Wed, 19 Dec 90 15:51:28 PST > Date: Wed, 19 Dec 90 15:51:28 PST > From: bmiller (Bob Miller) > Message-Id: <9012192351.AA340530@sprite.Berkeley.EDU> > To: bugs > Subject: msgs problem > > > I'm having a problem with msgs...I can get the heading information, but > cannot access the actual message. It just goes on to the next message > heading. Any thoughts??????????????? The problem is that seeking a file to the current offset plus 0 when file is on a peusdo file system and the program is running on a decStation doesn't work. In C that is: lseek(fd, 0, L_INCR) always returns -1 with errno set to invalid argument. Note that this only happens when using offset of 0 and when running on a decStation. Bob, until we get this fixed you can read msgs from any non-DEC machine such as allspice or any sparcStation. Mendel Log-Number: 30565 Date: Sun, 23 Dec 90 00:04:04 PST From: tve (Thorsten von Eicken) Subject: cc -V broken on ds3100 cc -V is supposed ti print version and command line flag info. It uses /usr/ucb/what which we don't have on sprite. Copying the binary from dill doesn't work for some obscure reason. TvE